«  What is more popular, John McCain or a rash? Main The Googlization of Venture Capital?  »


Official Google Blog:

We knew the web was big... 7/25/2008 10:12:00 AM We've known it for a long time: the web is big. The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. Recently, even our search engineers stopped in awe about just how big the web is these days -- when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

How do we find all those pages? We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.

So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite -- for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We're not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what's a useful page, and there is no exact answer.

We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers. But we're proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world's data.

To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it'd be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections.

As you can see, our distributed infrastructure allows applications to efficiently traverse a link graph with many trillions of connections, or quickly sort petabytes of data, just to prepare to answer the most important question: your next Google search.

This is the sort of description and explanation that Google should have offered years ago. During my last visit to Mountainview, Google folks were very clear with me that they are pushing for better transparency in all areas of their business. Of course, not everything could or should be public. But the basic contours of what Google does should certainly be out there for the rest of us to consider. It's in the company's best interest to squelch conspiracy theories and encourage trust among users.

Expect more of this in the near future.

Saul Hansell of the NYTimes sees this move and more as an important first step in a better, more ethical Web environment:

... Google, which gathers a lot of data and to some is increasingly scary, has decided to shine a spotlight on one way it is using all that information. According to a blog post on Wednesday, Google will start explaining how it customizes the search results it displays. Google uses its best guess about where you are and sometimes the history of what you searched for in an attempt to provide more relevant results.

Now a small note in the upper-right-hand corner of the results pages will give some clue that this is happening. In one example, the note reads “Customized for the San Francisco metro area.” The text may also have a link to a page that has additional information. In the example of this sort of page, Google showed the Internet Protocol address it used to determine that the search came from San Francisco. It also identified the previous search terms it was taking into account.

In another nice twist, that page has a link to the search results that would have been shown if Google didn’t take into account the information about the user. That way people can make some choices about how much information to give Google.

Google doesn’t allow users to tell it to disregard their location in its search. But Google won’t use its records of what you searched for in the past if you opt out of its Web History feature, which keeps a record of queries that a user can review.

All this is a great step toward helping people understand what Google is doing with their information. But it also raises questions too. What are the other places Google is using information about users? This disclosure relates to search results, not the advertising on search pages. Google has also started using data about users for advertising, and its acquisition of DoubleClick makes many in the advertising world think it will start using even more data.

It would be great if Google, and other Web companies, offered similar disclosure for data-based advertising. I’d like to be able to see what data was used in deciding to show an ad to me and who will get what information if I click on it. Yes, some of this may be seen as “proprietary” information, but to my mind a company that wants to use my “proprietary” history of Web surfing needs to come clean about what it is doing with that data.

When we can flip on that light switch, for Google and the rest of the online data grabbers, then we will be able to see just how scary the Internet data monster really is. ...

UPDATE: Google has announced:


... You'll see these new messages whenever your search results have been customized based on one or more of the following types of information:

* Location. By default, we identify your approximate city location based on your computer's IP address and use it to customize your search results. If you'd like Google to use a different location, you can sign into or create a Google Account and provide a city or street address. Your specific location will be used not only for customizing search results, but also to improve your experience in Google Maps and other Google products.

* Recent searches. We take into account whether a particular query followed on the heels of another query. Because recent search activity provides such valuable context for understanding the meaning behind your searches, we use it to customize your results whenever possible, regardless of whether you're signed in or signed out. In order to customize your results and show you the customization details, we keep the most recent query on your browser for a limited time. After that, the information is removed from your browser and disappears immediately if you close your browser.

* Web History. If you're signed in and have Web History enabled, we customize your search results based on what you've searched for in the past on Google, and what web sites you've visited. One important note about Web History: it belongs to you and you have complete control over it. You can remove specific items or pause the service at any time. And if there's a particular search that you'd rather not have personalized based on your Web History, you can also just temporarily sign out of your Google Account.

This new feature doesn't change anything at all about how you search on Google and the results you get; it just gives you more of a behind-the-scenes look at how we customize your search experience. We consider this to be an important step in our commitment to transparency, and we hope you find it informative and useful.
...

arrow

Post a comment

We had to crank up the spam filter so it may take a little while to appear. Thanks.

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

A book in progress by

Siva Vaidhyanathan

Siva Vaidhyanathan

This blog, the result of a collaboration between myself and the Institute for the Future of the Book, is dedicated to exploring the process of writing a critical interpretation of the actions and intentions behind the cultural behemoth that is Google, Inc. The book will answer three key questions: What does the world look like through the lens of Google?; How is Google's ubiquity affecting the production and dissemination of knowledge?; and how has the corporation altered the rules and practices that govern other companies, institutions, and states? [more]

» Send links, questions and ideas:
siva [at] googlizationofeverything [dot] com

» To reach me for a press query, please write to SIVAMEDIA ut POBOX dut COM

» To reach me for a speaking invitation, please write to SIVASPEAK ut POBOX dut COM

» Visit my main blog: SIVACRACY.NET

» More about me

Topics

Like the Mind of God (57 posts)

All the World's Information (75 posts)

What If Big Ads Don't Work (20 posts)

Don't Be Evil (16 posts)

Is Google a Library? (84 posts)

Challenging Big Media (46 posts)

The Dossier (49 posts)

Global Google (26 posts)

Google Earth (6 posts)

A Public Utility? (37 posts)

About this Book (28 posts)

RSS Feed icon  RSS Feed


Powered by Movable Type 3.35