Recent Stories
- Businesses urged to tap into science and technology young talent
- Digital relay baton enables remote crowd cheering of athletes
- Health Innovation Campus moves a step closer
- £7.1 million R&D boost for North West businesses
- Centre of excellence created for the next industrial revolution
- Artificial intelligence toolkit spots new child sexual abuse media online
- Strategic partnership set to help plug cyber security skills gap
- What your choice of smartphone says about you
- InfoLabTree: Discover the Story
- novi.digital Launch Event - 'An Event to Help Businesses Grow Online'
RSS Feeds
RSS feeds can deliver the latest InfoLab21 news and events direct to your browser without you having to visit the website.
In most browsers you can click on an RSS link and choose to subscribe to the feed to add it to your favourites or bookmarks.
BlogSpot: Can Search Engines Count?
About the author
In the last ten years, researchers who analyse natural language (English or otherwise) have begun to download and collect massive numbers of web pages. Dr Paul Rayson discusses one of the possible problems with collecting natural language data this way; that of unreliable word counts from web data:
Prior to the emergence of the web, the main way to collect enough examples of natural language was to transcribe it yourself, scan and OCR printed texts, or find a friendly publisher who could provide machine-readable versions of publications.
Many different types of research can be carried out using this web data. In corpus linguisitcs, researchers collect text from the web to study online varieties of English or other languages. For example, whether Twitter or blogs have a specific style, to find and describe examples of linguistic features that may not occur frequently enough in smaller collections otherwise built by hand, or to study language samples where other collections are not readily available.
In computational linguistics, researchers use the text to build language models for training Natural Language Processing tools which then can automatically analyse language at a variety of levels e.g. grammar or semantics. Extremely large collections of language are used by dictionary publishers as their basis for updating dictionary entries and examples, e.g. the Oxford English Corpus contains over two billion words of real language examples. Text analytics companies are also mining web data for opinions about products, services or even political parties.
Having said that, there are certain areas where building a large enough corpus even from the web is not feasible, and it is tempting to use estimated result counts derived from search engines instead of downloading all the data and counting words yourself. This led me to wonder how good these estimates were for research purposes.
Here's a simple experiment.
- Choose a word and then type it into the search box of your favourite search engines.
- Compare the estimated result counts and see if they match up. These are usually shown at the top of the search engine results page as shown here for Google.
- Click through a few times in order to see if the counts stay the same once you've reached the 10th page of results.
- Come back tomorrow at the same time to see if the numbers are still the same then.
For example, on 24th September 2012, I searched for "Lancaster" on Google and it estimated 34,600,000 results. Bing told me there were 251,000,000 hits. After I'd clicked through 10 pages of Google results, it told me that its estimate was 196,000,000 results. This might make us think that the estimated result counts are unreliable, or at the very least, there are many different ways of estimating the numbers.
You can also compare results within the same search engine to see if the initial estimated count makes any sense. For example, if you search for "Manchester" on Bing, it returns an estimated 591,000,000 hits. With Manchester being a much larger city, it seems likely that there should be many more than this relative to "Lancaster" which had just under half this number. Of course, these results are not just for place names but also names of drinks, pubs, people and football teams.
This possible inaccuracy may not seem much of a problem and general search engine users mostly care about finding a useful website on the first page of hits. But imagine you are using these estimated counts to build a computer program to learn facts about the world, collect opinions about a product or analyse language in some way. Then you'd want these result counts to be as accurate as possible. Commercial search engines employ a range of techniques to estimate the counts, and so it is important that researchers understand the implications and how to minimize this instability.
Two undergraduate students from the School of Computing and Communications at Lancaster University (Oliver Charles and Ian Auty), recently carried out projects to explore the stability of these search engine result counts. They built software to check the estimated result counts from three search engines for a few thousand different words every day for six months and plotted the results to see if there were any trends and problems. For example, this picture shows how erratic Bing's estimated result counts were during January 2012 when most results for a group of words dropped from around 100,000,000 to less than 100. (click graph to enlarge)
We also devised a set of guidelines on how future projects can ensure they are using accurate frequency count data from search engines. The experiments formed the final year undergraduate projects for Oliver and Ian, but were also presented in April 2012 at the 'Web as corpus' workshop at the World Wide Web conference in Lyon, France (WWW2012). To view the report and read the guidelines please go to sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf#page=23.
Wed 26 September 2012
Associated Links
- 7th Web as corpus workshop (WAC-7)
- Can Google count? Estimating search engine result consistency - Paper by Paul Rayson, Oliver Charles and Ian Auty (page 23)
- Dr Paul Rayson - Paul Rayson's research home page
- UCREL research centre at Lancaster
- World Wide Web 2012 conference