«  Data centers pollute like crazy Main Nominate someone for the Public Knowledge IP3 award!  »

The blog Digital Scholarship in the Humanities has an interesting essay on this subject:

But how reliable are these electronic texts? Can researchers feel comfortable citing them and using them for text analysis? In my view, the quality of an electronic text and its appropriateness for use in scholarship depend on 6 factors:

* Quality of the scanning: Is the complete page captured? Is the image skewed or distorted? Is the image of sufficient resolution?
* Quality of the OCR/text conversion: Is full text provided? What method was used to produce the text–double-keying or OCR? How accurate is the text? Are the texts marked up in TEI (Text Encoding Initiative)? Are words joined across line breaks? Are running heads preserved?
* Quality of the metadata: Is the bibliographic information accurate? Is it clear what edition you are looking at? If there are multiple volumes, do you know which volume you are getting and how to locate the other volume(s)?
* Terms of use: What are you legally able to do with the digitized work? Can you download the full-text and use tools to analyze it? Is the content freely and openly available, or do you have to pay for use?
* Convenience: Can you easily download the text and store it in your own collection? How much work do you have to do to convert the text into a format appropriate for use with text analysis tools? How hard is it to find the electronic text in the first place? Is there a Zotero translator for the collection?
* Reputation: Is the digital archive well-regarded in the scholarly community? If you cited the archive in your bibliography, would fellow researchers question your decision? Does the archive provide clear information about its process for selecting, digitizing, and preserving texts?

I focused my evaluation on the main collections that I plumbed for the primary source works in my dissertation bibliography: Google Books (GB), Open Content Alliance (OCA), Early American Fiction (EAF), Project Gutenberg (PG), and Making of America (MOA). I found the OCA works in the Internet Archive (they are marked as belonging to the “American Libraries” or “Canadian Libraries” collections.) I apologize in advance for the length of this post, but I want to dig into the details. ...


Post a comment

We had to crank up the spam filter so it may take a little while to appear. Thanks.

A book in progress by

Siva Vaidhyanathan

Siva Vaidhyanathan

This blog, the result of a collaboration between myself and the Institute for the Future of the Book, is dedicated to exploring the process of writing a critical interpretation of the actions and intentions behind the cultural behemoth that is Google, Inc. The book will answer three key questions: What does the world look like through the lens of Google?; How is Google's ubiquity affecting the production and dissemination of knowledge?; and how has the corporation altered the rules and practices that govern other companies, institutions, and states? [more]

» Send links, questions and ideas:
siva [at] googlizationofeverything [dot] com

» To reach me for a press query, please write to SIVAMEDIA ut POBOX dut COM

» To reach me for a speaking invitation, please write to SIVASPEAK ut POBOX dut COM

» Visit my main blog: SIVACRACY.NET

» More about me


Like the Mind of God (22 posts)

All the World's Information (26 posts)

What If Big Ads Don't Work (10 posts)

Don't Be Evil (9 posts)

Is Google a Library? (43 posts)

Challenging Big Media (18 posts)

The Dossier (19 posts)

Global Google (3 posts)

Google Earth (3 posts)

A Public Utility? (19 posts)

About this Book (16 posts)

Other books by Siva:


Rewiring the Nation: The Place of Technology in American Studies (Johns Hopkins University Press, 2007)

The Anarchist in the Library (Basic Books, 2004)

Copyrights and copywrongs cover

Copyrights and Copywrongs: The Rise of Intellectual Property and How it Threatens Creativity (New York University Press, 2001)


  • Sivacracy.net
  • if:book
RSS Feed icon  RSS Feed

Powered by Movable Type 3.35