Introduction to Google search
Introduction to Google Search
Abstract
Search engines play an important role in the way people browse the internet. They are required to be fast, and produce good quality results for the user. However, the size of the internet poses a particular problem, as trawling through millions of sites and matching them to a query is a time and resource consuming process.
Google’s architecture aims to provide a solution to this problem, by performing efficient crawls of the internet, and indexing documents in such a way that they are easy to retrieve later on. Google also uses what they call a ‘cluster architecture’ which features geographically separated ‘clusters’ of many computers. Each query is handled by one cluster, which contains replicated servers that can process a query in parallel by dividing the index into smaller, easier to manage ‘index shards’, effectively speeding up the processing, and also providing a level of fault tolerance.
This solution is also cost effective, as Google uses cheaper computers than typical server-class machines, but opts to use more in order to do its processing in parallel effectively and also increasing the reliability of the system (the failure of one machine is much less significant when there are a large number of them), resulting in a higher price/performance ratio.
Introduction
Google search is currently the most frequently utilised search engine on the internet, originally developed by Sergey Brin and Lawrence Page in the late 1990s. With a growing number of web resources becoming available search engines have played a more integral role in the way users experience the internet.
The Google search engine was created in order to address several problems existing systems faced, one of which was the returning of low-quality results. These engines could also be exploited by advertisers in order to make their site appear at the top of a search list while not necessarily being a ‘high-quality’ site. Google achieved this by developing their own page ranking algorithm, which aimed to measure the importance or quality of a particular page. Google also aimed to be a scalable and reliable service to meet the needs of an expanding Web, through the use of several means including the efficient use of storage space for page indexes, optimized data structures for efficient access of these indexes and a cluster architecture for serving queries.
Crawling and Indexing the Web
<imagePlaceholder - figure 1>
It is important that a search engine knows about everything on the Web, therefore efficient crawling of the Web is a necessity. Google uses a distributed crawling system which obtains URLs from a dedicated URL server. Several crawlers run at once, each keeping about 300 connections open at any given time. Since DNS lookup can be quite costly, each crawler keeps its own DNS cache so that constant DNS lookups do not need to be performed every time a document is crawled.
Pages obtained are then sent to the store server which compress and put them into a repository. Indexing of the pages is a rather complex process, involving the indexer and sorter (figure 1). Essentially, the documents are parsed, word occurrences in the document are converted into wordIDs using the lexicon (an in-memory hash table) and translated into a ‘hit list’ (which contains information about the word occurrence including features such as position, size, capitalisation etc.), and a forward index (doc -> wordID) is generated by the indexer and stored in the forward barrels. The sorter then generates a backward index (wordID -> doc) and stores them in inverted barrels.
The forward index contains a list of occurrences of word IDs in a given document with their hit lists, while the inverted index contains a list of document IDs that contain a given word and its hit lists.
Google's Cluster Architecture
Since much computation is needed per query, and often many queries require servicing at any given moment in time, the Google search engine requires immense computational power. Such a system comes with many issues, which include cost, reliability and speed.
In order to support the high level traffic of a search engine, Google’s cluster architecture consists of a very large number of commodity class PCs (as opposed to a fewer number of server-class PCs) organised into geographically distributed clusters of several thousand PCs. This is done to be both cost-effective and superior in performance to an implementation that uses fewer but more expensive high end servers. Lower end PCs can be used, because Google provides their reliability through software, rather than hardware. Reliability is achieved by replicating their services across a large number of machines, with failures being automatically detected and handled by the system. Such a setup allows for situations such as power outages or other localised disasters, as there will be machines located elsewhere to take their place.
An architecture such as Google’s is appropriate for applications such as search engines, or services requiring large amounts of computation, but uses stateless servers, as it is then easy to distribute requests over multiple servers. The main disadvantages of such a system appear to be issues of power and cooling. Since there are many more machines and they are of lower quality, such an architecture also means more repair costs and administration. However in Google’s case, there are only a few applications a server can be running, making maintenance much simpler and cheaper.
Serving a Query
<imagePlaceholder - figure 2>
When a query is sent, Google’s load balancing system selects the cluster which the user’s request will be processed by. This is done by examining the available capacity at each cluster, and taking into account the user’s physical location in order to minimise the time it takes to process the user’s request. The query is then sent to the selected cluster and processed locally to the chosen cluster.
Each cluster contains a load balancer that keeps track of available Google Web servers, which co-ordinate the query execution with the index servers (as shown in figure 2). The index server(s) must search the entire list of documents containing every word in the query (which was stored earlier as an inverted index when the system last crawled and indexed the Web) and compute a rank score for every document that does match. This process is time consuming, due to the large amounts of data involved. Google manages to deal with this by parallelising. The index is broken down into smaller ‘index shards’ and each shard is sent to a different machine for processing. Should a machine be unavailable, the shard can simply be redirected to another. The number of shards can also be increased, should the index become larger, which is essential if the system hopes to scale well with the growth of the Web.
The next step is obtaining the ordered list of documents and title and associated keywords in context. This is managed in a similar way to looking up the indexes, by splitting the documents into smaller, easier to manage shards and distributed around the cluster for processing.
Conclusions
Google’s architecture is reliable. It achieves reliability by using a multitude of computers, using software techniques as opposed to hardware techniques to be more cost effective, and allow for more individual component failures. Its cluster architecture also ensures the system is protected against localised disasters due to its geographic distribution around the world.
Google is also fast. Speed is achieved through the use of its replication mechanism and ability to process a query in parallel amongst a large number of computers. Google’s adaptive load balancing mechanism for cluster selection also ensures the best performance possible is achieved.
On top of Google’s architecture, it also has its page ranking algorithm, which places it at the front of its competition, and has changed the way many people experience the Web.
References
- Brin, S. & Page, L. The Anatomy of a Search Engine. Retrieved September 13, 2008, from http://infolab.stanford.edu/~backrub/google.html
- Barroso, L. A. & Dean J. & Hölzle U. (2003). Web Search for a Planet: The Google Cluster Architecture. Retrieved September 12, 2008 from http://research.google.com/archive/googlecluster-ieee.pdf