«  Great article on Google and Privacy from Harvard Magazine Main Confirming the Google Mobile Phone OS  »


Economist Paul Courant was the provost of the University of Michigan when it decided to be the first and boldest partner of Google for the library scanning project that has become part of Google Book Search.

Paul recently became the University Librarian at Michigan.

On his new blog, Au Courant, he addresses criticisms raised by Brewster Kahle, me, and others.

November 4, 2007 On being in bed with Google Filed in Mass digitization , Google , Michigan , Libraries

One of the things that surprises me most about reactions to the Google Library Project is that smart people whom I respect seem to think that the only reason that a university library would be involved with Google is because, in some combination, its leadership is stupid, evil, or at best intellectually lazy. To the contrary, although I may be proved wrong, I believe that the University of Michigan (and the other partner libraries) and Google are changing the world for the better. Four years from now, all seven million volumes in the University of Michigan Libraries will have been digitized – the largest such library digitization project in history. Google Book Search and our own MBooks collection already provide full-text access to well over a hundred thousand public domain works, and make it possible to search for keywords and phrases within hundreds of thousands more in-copyright materials. This access is altering the way that we do research. At least as important, the project is itself an experiment in the provision and use of digitized print collections in large research libraries. I do not see how we can discover the best ways to use such collections without experiments at this scale. In sum, I believe that our library is doing exactly what it should do in the best interests of scholarship and our users, now and in the future.

So I’m puzzled when people ask, “How could serious libraries be doing this? How could they abdicate their responsibilities as custodians of the world’s knowledge by offering their collections up as a sacrifice on the altar of corporate power? Why don’t they join the virtuous ranks of the Open Content Alliance partners, who pay thousands of dollars to digitize books at a rate of tens of thousands of volumes a year?” It seems like those who ask such questions have little appreciation of what Michigan and the other Google partners are actually up to.

Google is on pace to scan over 7 million volumes from U-M libraries in six years at no cost to the University. As part of our arrangement with Google, they give us copies of all the digital files, and we can keep them forever. Our only financial outlay is for storage and the cost of providing library services to our users. Anyone who searches U-M’s library catalog, Mirlyn, can access the scanned files via our MBooks interface. That’s right, anyone. (Copyright law constrains what we can display in full text, and what we can offer only for searching, but we share as much as we can consistent with prudent interpretations of the law.) For an example of an MBook, take a look at The Acquisitive Society by R. H. Tawney.

In a recent New York Times article about mass digitization projects, Brewster Kahle was quoted as saying: “Scanning the great libraries is a wonderful idea, but if only one corporation controls access to this digital collection, we’ll have handed too much control to a private entity.”

I agree with him. I’m an economist with a particular interest in public goods, which is how I came to be involved with libraries in the first place. Libraries have a long and honorable history of preserving information and making it accessible. Moreover, even at their best, for-profit institutions cannot be expected to serve general public interests when those interests run counter to those of their shareholders. So I would be distressed if a single corporation controlled access to the collections of the great academic libraries, just as I find it troubling, on a smaller scale, that a handful of publishers control access to much of the current scientific literature.

But Google has no such control. After Google scans a book, they return the book to the library (like any other user), and they give us a copy of the digital file. Google is not the only entity controlling access to the collection – the University of Michigan and other partner libraries control access as well. Except we don’t think of it as controlling access so much as providing it.

Since 2005, Siva Vaidhyanathan has been making and refining the argument that libraries should be digitizing their collections independently, without corporate financing or participation, and that those who don’t are failing to uphold their responsibility to the public. “Libraries should not be relinquishing their core duties to private corporations for the sake of expediency.”

“Expediency” is a bit of a dirty word. Vaidhyanathan’s phrase suggests that good people don’t do things simply because they are “expedient.” But I view large-scale digitization as expeditious. We have a generation of students who will not find valuable scholarly works unless they can find them electronically. At the rate that OCA is digitizing things (and I say the more the merrier and the faster the better) that generation will be dandling great-grandchildren on its knees before these great collections can be found electronically. At Michigan, the entire collection of bound print will be searchable, by anyone in the world, about when children born today start kindergarten.

Google brings to us extraordinary technical and computing power and tremendous financial resources. The libraries bring an understanding of our collections and our users, and a profound commitment to public access. We are not relinquishing our duties in the name of expediency; we are working with a capable partner to create a far more useful resource than we could create on our own. (Would I prefer that a charitable foundation would support this work on the same schedule as Google, and make everything available to everyone, subject only to copyright restrictions? You bet. I would prefer it even more if that foundation would buy out all of the rights holders for all out of print works. Can someone tell me the name of the foundation, please? In the meantime, it seems to me that being in bed with Google is way better than sleeping alone.)

It’s true that the digitized files from Google’s scans are often far from perfect. Historian Robert Townsend, Paul Duguid, and others have raised technical questions about the quality of Google’s scans, and their appropriateness for preservation. Those are important questions, and there is a great deal of work to be done, both by Google and by the libraries, before we consistently achieve the level of quality and bibliographic reliability that are essential to successful scholarly practice. I will discuss some of the specific steps we are taking to address quality in a future post, but for now I will just say that the solution of these problems will require the serious engagement of academic libraries, and that the visibility of the problems is essential to their solution. Mass digitization on the scale of the Google library project was unimaginable five years ago, and it comes as no surprise to me that we are learning a lot as we go long. We are learning in the tradition of serious academic work, by putting our ideas and our resources in the public eye, where they can be seen, and criticized, and improved.

I am very glad that Paul has entered this conversation. His commitment to revolutionizing the role of the public university in the information ecosystem is inspiring.

I am, however, troubled by this claim: "We have a generation of students who will not find valuable scholarly works unless they can find them electronically."

Is this true? It's certainly no more true of my students than it was of my peers in the 1980s. Where is the evidence?

Sadly, Paul does not actually address the real-world consequences of the Google project:

• He dismisses serious search problems as temporary, yet fails to confront the problem that Google cannot and will not explain the factors and standards that put one book above another in search results.

• As users discover poorly-scanned files on the Google index, how can they alert Google to the problem? Why does nothing in the contract between Michigan and Google include quality-control standards or methods?

• How do we know this index will last for decades? What image file system is Google using and what ensures its preservation?

• How is the "library copy," that electronic file that Michigan and others receive as payment for allowing Google to exploit their treasures, NOT an audacious infringement of copyright? It violates both the copyright holder's right to copy and right to distribute. Doesn't a university library have an obligation to explain this?

• What about user confidentiality? Why have university failed to make a stand on this issue?

I look forward to responses from Paul and others. I have been waiting two years for them, of course. And all I get is the silence created by non-disclosure agreements.

BTW, should public university librarians be signing non-disclosure agreements about their core services?

arrow

Comments (10)

Paul doesn't make it clear whether Michigan is receiving the OCR (optical character recognition) results from Google, or just images of the pages: he just says "digital files." This is a crucial distinction. (See my recent post on why this is important.) If Michigan did not negotiate to get the OCR output, including structural hints such as font size or word bounding boxes, then they made a mistake, and ceded a tremendous amount to Google.

Karen Coyle on November 6, 2007 9:44 AM:

"Why does nothing in the contract between Michigan and Google include quality-control standards or methods?"

Section 2.4 of UMich contract:
"U of M will engage in ongoing review ... of the resulting digital files, and shall inform Google of files that do not meet benchmarking guidelines... Should U of M encounter a persistent failure by Google to meet these guidelines... U of M may stop new work until this failure can be rectified."

"Paul doesn't make it clear whether Michigan is receiving the OCR (optical character recognition) results from Google, or just images of the pages."

section 2.5 of UMich contract:
"... U of M Digital Copy will consist of a set of image and OCR files..."

When you retrieve a book in their catalog, you have the option of look at the image of the page or the text.

The University of California also negotiated to receive a copy of the "map" that connects a keyword in the OCR file to the actual place on the page, so that searched terms can be highlighted.

Many of the contracts are available at:
http://books.google.com/googlebooks/partners.html

Paul Courant on November 6, 2007 1:56 PM:

I am traveling, and can only produce brief answers to Siva's questions here. Later this week I'll get to most of the issues in more detail on my own blog at paulcourant.net

Let me start by reminding everyone that I do not speak for Google, nor am I engaged in generalized cheerleading on Google's behalf. Rather, I am arguing that the University of Michigan Library is doing a Good Thing in its digitization project with Google.

The bullets below are Siva's, the paragraphs without bullets are mine.

• He dismisses serious search problems as temporary, yet fails to confront the problem that Google cannot and will not explain the factors and standards that put one book above another in search results.

Actually, I don't mention search at all in my post. Nor (see above) do I speak for Google.

• As users discover poorly-scanned files on the Google index, how can they alert Google to the problem? Why does nothing in the contract between Michigan and Google include quality-control standards or methods?

Please see Michigan's agreement with Google, clause 2.4, the relevant part of which reads: "U of M will engage in ongoing review (through sampling) of the resulting digital files, and shall inform Google of files that do not meet benchmarking guidelines or do not comply with the agreed-upon format. Should U of M encounter a persistent failure by Google to meet these guidelines or supply the agreed-upon format, U of M may stop new work until this failure can be rectified." The agreement is online at: http://www.lib.umich.edu/mdp/umgooglecooperativeagreement.html

• How do we know this index will last for decades? What image file system is Google using and what ensures its preservation?

I believe that in my post I said that the UM library (like other partner libraries) is also storing and preserving the files that Google scans. Maybe Google won’t last for decades, but the libraries will, and the libraries are pretty serious about preservation.

• How is the "library copy," that electronic file that Michigan and others receive as payment for allowing Google to exploit their treasures, NOT an audacious infringement of copyright? It violates both the copyright holder's right to copy and right to distribute. Doesn't a university library have an obligation to explain this?

It's hard to get past the first premise of this set of questions. One literal answer would be to say that there is no such electronic file, because Google is not obtaining anything by means of exploitation.

I must say that I am troubled that the author of a very sensible book about copyright is so enthusiastic about trashing Google that he is willing to give up on the uses, notably scholarly uses, that are permitted in the higher-numbered sections of the Copyright Act. As my institution's copyright lawyer says: "FAIR USE, it's the law." And my institution believes that when we have Google digitize our holdings we do so under the law and in order to make uses that are not only lawful, but that are completely consistent with the undergirding purpose of copyright law.

Siva is much younger than I am, so he may be willing to wait decades before finding out how scholarship and society can benefit from digitized and searchable collections from some of the world's great libraries. For myself, I'd like to unleash my colleagues and our students on this remarkable resource while I'm still around to see what happens.

Finally, re Ryan Shaw's post, yes, we receive the OCR.

Siva Vaidhyanathan on November 6, 2007 2:17 PM:

Thank you, Paul.

When I raised quality issues, I was asking how users (people like me and you) might alert Google about bad scans and images. And the clause you cite in the contract only pertains to a persistent failure to meet some vague standards of quality. What about particular scans?

I presumed you did mention search problems by citing Paul Duguid's work. I am sorry I confused that with the issue of poor scan quality.

Just to push the copyright issue a bit farther:

How, exactly, is the process of having a corporation scan entire copyright works, then give such scanned files back to a library as payment for access to the copyright works, a fair use?

Please cite one precedent or clause of Sec. 107 that makes that a fair use?

That's just one part of how the Google fair use defense falls apart.

BTW, having Google scan your works is exactly the problem! It moves the issue from Sec. 108 to Sec. 107 -- from the stable to the vague. But that's a longer discussion.

If you are interested in my copyright analysis of the Google scanning issue, please see my article in the UC Davis Law Review:

http://lawreview.law.ucdavis.edu/articles/Vol40/vol40_no3.html

Peter Brantley on November 6, 2007 5:01 PM:

Fwiw, my understanding is that Michigan does not receive OCR with text coordinates. The Univ. of California does receive this data, perhaps uniquely; I believe of all the public contracts, these data are only mentioned in UC's.

bowerbird on November 6, 2007 8:42 PM:

> Google cannot and will not
> explain the factors and standards
> that put one book above another
> in search results.

so what? who cares? no big deal.
we'll make our own search engine
to pore over the books, thank you.


> As users discover
> poorly-scanned files
> on the Google index,
> how can they alert Google
> to the problem?

there's a form on the bottom of
each page to report feedback...


> Why does nothing in the contract
> between Michigan and Google
> include quality-control
> standards or methods?

look. you will find something there.

the _real_ question here is how can
google -- _and_ the o.c.a. -- keep on
scanning book after book after book
without seeming to learn _anything_
about how to execute quality-control?

ok, maybe for the first 2 million books,
it was a "learning experience" for 'em
-- although i coulda helped 'em a lot --
but by _now_ you'd think they'd get it.

but nope. the books continue to be bad.
not wonderful, like we could hope for.
not even good. bad. actually, _awful_.
sorry. go back and do them all again...


> How do we know this index
> will last for decades?

because duct tape works wonders...
(that is, what a stupid question.)


> What image file system is Google
> using and what ensures its preservation?

constant migration. just like the birds.

-bowerbird

p.s. ryan, just go look at their site.
it's a university. do your homework.

Ryan Shaw on November 8, 2007 7:07 PM:

The contracts do mention OCR, but as I suspected they do not specify what the OCR output should consist of, because the libraries were thinking only of access to the digital files (i.e. people reading them), not computing on those files (i.e. machines processing them). Apparently (according to Peter) only UC had the foresight to think about that (and you can be sure that Google was thinking about it). So I stand by my assertion that the libraries that did not negotiate for the full OCR output made a mistake, and ceded a tremendous amount to Google.

But Ryan, UC had the advantage of seeing some of the other contracts that had gone before and seeing what some of the original partners had received at the time they struck their deal. One would hope they'd know to be more specific than the first partners.

siva said:
> When I raised quality issues,
> I was asking how users
> (people like me and you)
> might alert Google about
> bad scans and images.

oh please, siva. do your homework.
google has had a "provide feedback"
link on each page from the start,
for people to report bad pages...

it's true you now must click on the
"basic html mode" link to get to it,
but don't act like it's impossible.

and ryan, you are backpedaling.
and doing none too well at it,
because that _coordinate_data_
is next to worthless in reality,
for quite a number of reasons...

(so peter, for you to trumpet it,
as if it actually _were_ useful,
is rather embarrassing to you.)

meanwhile, paul (courant), the
_real_ issues you need to address
have little -- perhaps nothing --
to do with google. they are about
what's happening in your own shop.

ready to discuss those issues?

-bowerbird

Wayne Martin on November 26, 2007 8:22 PM:

About reporting bad scans .. there is a feedback link that is associated with each book, at least those than can be downloaded. The feedback page is designed to allow a user to specify problems with a scan, as well as provide space to paste a link that points to the badly scanned page(s).


Post a comment

We had to crank up the spam filter so it may take a little while to appear. Thanks.

A book in progress by

Siva Vaidhyanathan

Siva Vaidhyanathan

This blog, the result of a collaboration between myself and the Institute for the Future of the Book, is dedicated to exploring the process of writing a critical interpretation of the actions and intentions behind the cultural behemoth that is Google, Inc. The book will answer three key questions: What does the world look like through the lens of Google?; How is Google's ubiquity affecting the production and dissemination of knowledge?; and how has the corporation altered the rules and practices that govern other companies, institutions, and states? [more]

» Send me links, questions and ideas:
siva [at] googlizationofeverything [dot] com

» Visit my main blog: SIVACRACY.NET

» More about me

Topics

Like the Mind of God (6 posts)

All the World's Information (5 posts)

What If Big Ads Don't Work (4 posts)

Don't Be Evil (3 posts)

Is Google a Library? (15 posts)

Challenging Big Media (11 posts)

The Dossier (3 posts)

Global Google (1 post)

Google Earth (no posts)

A Public Utility? (7 posts)

About this Book (7 posts)

Other books by Siva:


Rewiringcover.jpg

Rewiring the Nation: The Place of Technology in American Studies (Johns Hopkins University Press, 2007)


The Anarchist in the Library (Basic Books, 2004)


Copyrights and copywrongs cover

Copyrights and Copywrongs: The Rise of Intellectual Property and How it Threatens Creativity (New York University Press, 2001)

Links

  • Sivacracy.net
  • if:book
RSS Feed icon  RSS Feed


Powered by Movable Type 3.35