Skip Links | Site Map | Privacy & Cookies



BlogSpot: Can Search Engines Count?

Dr Paul Rayson

About the author

Dr Paul Rayson Paul Rayson is the director of the Lancaster University Centre for Computer Corpus Research on Language (UCREL). He has undertaken recent projects in the areas of protecting children in online social networks (Isis), spatial humanities, studying language change in 20th century British English and extending a semantic annotation tool for research on metaphor. He has published over 80 papers in corpus-based natural language processing, is Production Editor on the Corpora Journal and co-editor of the Routledge series of frequency dictionaries. He co-organised the first four international Corpus Linguistics conferences (Lancaster, 2001-3 & Birmingham, 2005-7).

In the last ten years, researchers who analyse natural language (English or otherwise) have begun to download and collect massive numbers of web pages. Dr Paul Rayson discusses one of the possible problems with collecting natural language data this way; that of unreliable word counts from web data:

Prior to the emergence of the web, the main way to collect enough examples of natural language was to transcribe it yourself, scan and OCR printed texts, or find a friendly publisher who could provide machine-readable versions of publications.

Many different types of research can be carried out using this web data. In corpus linguisitcs, researchers collect text from the web to study online varieties of English or other languages. For example, whether Twitter or blogs have a specific style, to find and describe examples of linguistic features that may not occur frequently enough in smaller collections otherwise built by hand, or to study language samples where other collections are not readily available.

In computational linguistics, researchers use the text to build language models for training Natural Language Processing tools which then can automatically analyse language at a variety of levels e.g. grammar or semantics. Extremely large collections of language are used by dictionary publishers as their basis for updating dictionary entries and examples, e.g. the Oxford English Corpus contains over two billion words of real language examples. Text analytics companies are also mining web data for opinions about products, services or even political parties.

Having said that, there are certain areas where building a large enough corpus even from the web is not feasible, and it is tempting to use estimated result counts derived from search engines instead of downloading all the data and counting words yourself. This led me to wonder how good these estimates were for research purposes.

Here's a simple experiment.

  1. Choose a word and then type it into the search box of your favourite search engines.
  2. Compare the estimated result counts and see if they match up. These are usually shown at the top of the search engine results page as shown here for Google.
  3. estimated result counts
  4. Click through a few times in order to see if the counts stay the same once you've reached the 10th page of results.
  5. Come back tomorrow at the same time to see if the numbers are still the same then.

For example, on 24th September 2012, I searched for "Lancaster" on Google and it estimated 34,600,000 results. Bing told me there were 251,000,000 hits. After I'd clicked through 10 pages of Google results, it told me that its estimate was 196,000,000 results. This might make us think that the estimated result counts are unreliable, or at the very least, there are many different ways of estimating the numbers.

You can also compare results within the same search engine to see if the initial estimated count makes any sense. For example, if you search for "Manchester" on Bing, it returns an estimated 591,000,000 hits. With Manchester being a much larger city, it seems likely that there should be many more than this relative to "Lancaster" which had just under half this number. Of course, these results are not just for place names but also names of drinks, pubs, people and football teams.

This possible inaccuracy may not seem much of a problem and general search engine users mostly care about finding a useful website on the first page of hits. But imagine you are using these estimated counts to build a computer program to learn facts about the world, collect opinions about a product or analyse language in some way. Then you'd want these result counts to be as accurate as possible. Commercial search engines employ a range of techniques to estimate the counts, and so it is important that researchers understand the implications and how to minimize this instability.

Two undergraduate students from the School of Computing and Communications at Lancaster University (Oliver Charles and Ian Auty), recently carried out projects to explore the stability of these search engine result counts. They built software to check the estimated result counts from three search engines for a few thousand different words every day for six months and plotted the results to see if there were any trends and problems. For example, this picture shows how erratic Bing's estimated result counts were during January 2012 when most results for a group of words dropped from around 100,000,000 to less than 100. (click graph to enlarge)

Graph - Click for larger image

We also devised a set of guidelines on how future projects can ensure they are using accurate frequency count data from search engines. The experiments formed the final year undergraduate projects for Oliver and Ian, but were also presented in April 2012 at the 'Web as corpus' workshop at the World Wide Web conference in Lyon, France (WWW2012). To view the report and read the guidelines please go to sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf#page=23.

Wed 26 September 2012