How does Google work?

From Marks Wiki
Jump to navigation Jump to search

How does Google work?

How does Google work? This is an investigation that tries to answer this very broad question. In this document we will specifically look at Google’s architecture, what sequence of events take place when a query is entered into the Google webpage’s search box. We will also try to answer how Google offers such a highly available service where the server never goes down, how it is able to offer a very high response time and how it copes with an enormous number of users accessing its service at any particular time. Google is highly distributed system where all the Google servers are multithreaded and this is very relevant to SOFTENG325 because describing the key characteristics, benefits and challenges faced due to the use of distributed software systems are the main learning outcomes of this course. Google is also replicated into a number of other servers and the assignment of requests to individual servers is done by load balancing algorithms. This, as with any other server improves Google’s scalability and availability. All the concepts mentioned above are a part of SOFTENG325 course and investigating how Google works in the context of these concepts would be invaluable to our understanding of the course material.

First of all we will look at Google’s architecture. Before that let’s see how Google serves any query that user types in the search box on Google’s web page. When the user types a query the host browser performs a DNS (Domain Name System) lookup to map www.google.com to a particular IP address. To provide this highly available service, Google has made available clusters of thousands of servers at different geographical locations around the world. This kind of arrangement of machines makes the Google service immune against catastrophes where a few servers in a cluster or a whole cluster might fail. So far we have learnt in the lectures that DNS load balancing balances load across different application hosts that share a common domain name. Similarly, a load balancing system is used to assign particular cluster to the user’s query. This load balancing algorithm takes into account factors like user’s physical location in relation to a particular cluster and the available capacity of the individual clusters. Once a cluster has been assigned a HTTP request is sent to that particular server and the local hardware based load balancer assigns that request to a particular Google Web Server (GWS). This is again strictly based on the amount of load on each of the GWS’s in the cluster. The query is then processed by the assigned GWS and the results are returned to the client in the form of a HTML message.

When we enter a word in the Google search bar we quickly see the results to our query without thinking about the back processes that have caused those results being displayed to us. These back processes can be subdivided into two broad parts. In the first step index servers map each word in the query to a list of documents. This mapping is done by an inverted index that maps each word in the query to a number of relevant documents. This list of relevant documents for each query word is called the “hit list”. The indexing server stores these hits in barrels which automatically creates a partially sorted forward index. The forward index is further processed to create an inverted index which is used for the mapping. The indexing server also parses all the web pages and returns the links present in to be stored along with other important information about the pages in anchors file. Then the URL resolver comes in reads the anchors files and converts the relative URLs into absolute URLs and then into “docids”. Each document is then given a priority score with the ones with higher scores appearing at the top of search result page and lesser priority ones further down. The amount of data searched though during the whole process is thousands of terabytes and to speed up the process the mapping to relevant documents and processing to get the final ranked list is parallelised.

The index is divided into smaller parts called “index shards” which map to a subset of documents from the original index. To put it all into simple terms to process a query it is first divided up into the constitutive words and every single word is dealt separately. Then the mapping of all the words to the relevant documents is also done in parallel by somehow dividing up the index into a number of shards and doing mapping and processing of documents related to each shard concurrently at different machines. As we have learnt in 325 that data replication across different machines enhances its availability. In the same way, to ensure uninterrupted service all the shards have replicas which contain the exact copy of documents as the original shard. While a query is being processed, if one of the shards goes down due to some reason the processing is assigned to one of the replicas while the original shard gets fixed. When one of these replicas is down the systems scalability is reduced by the fraction of the requests that the replica can serve. This way all parts of the index remain available all the time. The end output of the first part of the query execution is an ordered list document identifiers called “docids”.

The second phase involves using the docids to find out the title, summary and uniform resource locator (URL) of these documents. Document servers form an important part of this phase which get the documents from their disks and provide a title and a brief summary which gives the user some idea about how the keyword has been used in the document. Please note that if two different keywords map to the same document the summary generated would be different for both. This is due to (as noted before) the summary being the context in which the keyword has been used in the document. This process like the first is also highly parallelised meaning all the documents have been divided up into smaller shards and having a number of server replicas for each of these shards. The assignment of the requests to a particular shard depends on a load balancing algorithm. In addition to these two phases the GWS also sends the query to a spell checking system and also to an advertising system to generate appropriate advertisements related to the query. After all this a final html page is generated with all the query results, advertisements and spelling suggestions if applicable.

Figure 1 Google's Architecture for handling queries Google crawl the World Wide Web at regular intervals, to update the information it has about web pages and to also take into account the new web pages that have been created since the last time it crawled the web. Since the World Wide Web is made up of billions of web pages crawling would take an extremely long time, so to finish it in the least amount of time as possible Google uses a distributed crawling system. A single URL server contains the list of all the URLs that have been crawled before. Of course this server has also been replicated a number of times like all the other important servers in the Google architecture. As a starting point, all the crawlers (termed as Googlebot) get their list of web pages to crawl from the URL server. Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. While crawling if links to other web pages are found, they are added to the “visit soon” queue of web pages to be crawled. Although all this might sound very simple there are a number of challenges faced. First of all the URLs in “visit soon” queue must be compared against the URLs in Google’s URL index to prevent the crawlers accessing the same page again. Google must also determine how often to crawl a particular web page as crawling an unchanged web page will be a waste of resources. So an algorithm is used to determine how often a particular web page should be crawled. Usually the web pages where the content changes frequently are crawled more often as compared to other more static web pages. Now moving onto how Google produces so high quality results. This is all related to the PageRank algorithm invented by the Google founders Sergey Brin and Lawrence Page. PageRank can be thought of as model for the probability of a user visiting a particular webpage. Therefore web pages with higher page ranks have got more probability of a visitor as compared to those with lower PageRanks. The equation used to calculate the page rank of a particular webpage is as follows:

PR (A) = (1-d) + d (PR (t1)/C (t1) + ... + PR (tn)/C (tn)

This was the equation published by the founders of Google in their original research paper on Google while at Stanford University in 1998. Google probably has made some changes to it over the years but would not reveal them. This equation gives us a basic idea of the type of algorithm Google uses to determine which page should appear at the top of your search results and what should appear further down. So imagining a page A with pages T1 to Tn having hyperlinks to it, PR (A) obviously refers to the PageRank of a page A. d is called the damping factor and is a permanent value of about of about 0.85. C (tn) refers to the total number of links to other pages on a page tn which is one of the n pages providing hyperlinks to A. In other words the PageRank calculation takes into account the damping factor, the number of other pages linking to that particular page and also the total number of links that each of those pages have. Together all these value give us the PageRank value of page A which is between 0 and 10.

From the information so far it is quite obvious that Google uses a lot of parallel processing in returning the result of a query by the user. By parallelizing the search over a lot of server machines Google reduces amount of time required to return a result for the query. As the individual shards do not need to communicate with each other all the processing is almost linear, the computation time is divided across a lot of CPUs and the results are combined in very fast and inexpensive merging step. At the same time the replications approach used by Google results in better throughput and availability. Information contained by one machine is replicated across many others, not only this results many different requests that require the same information/documents being processed at the same time but also makes the Google system immune to any type of faults in one or many of its server machines. If at any time one of the servers goes down there would always be a back up sever which would take its place. This of course theoretically does affect Google’s information retrieval speed as there is one less parallel server processing parts of the request concurrently but we normally never feel or realise this as there are so many other severs working which makes negligible affects to the speed. In the same way theoretically the throughput has been reduced but in real life we never feel it.

Another reason for fast response times by Google is the data structures used. These have been designed to avoid disk seeks wherever possible. One of the main data structures is the Repository which contains a full HTML for every web page. Each HTML file is compressed using a technique called zlib. The reason why zlib is used instead of others and especially bzip (which offers a higher compression ratio of 4 to 1 instead of 3 to 1) is because zlib offers higher data access speed compared to bzip. Another data structure used is Document index which is fixed width ISAM (Index sequential access mode) index, ordered by docid. Each index entry contains the current document status, a pointer that points to the actual document in the repository, a document checksum and other information. In case the document has been crawled, it would also contain the pointer to the variable with file called the docinfo which contains its URL and title. Otherwise it would only point to the URLlist which just contains the URL’s. So when a web browser requests www.google.com or any of the subsequent search result pages the html files sent by Google to the client server in the form of compressed file with the compression done by the gzip algorithm. The web browser upon receiving the gzip file decompresses it and displays it in the form of a web page. This results in data being fetched in one disk seek on the host and also causes faster transfer of the html page over the internet. As mentioned earlier hit lists are also an important data structure used which contains the list of occurrences of a particular word in a specific document including details like font, capitalisation and the position of the word in that document. A special encoding technique called hand optimised compact encoding in used to encode the position, font and capitalisation. Other important data structures are the forward and the reverse index where the reverse (inverted) index is created by further sorting of the forward index by a sorter.

Figure 2 Forward and Inverted indexes

The main limitation of Google search engine that we noticed during our investigation is that it does not crawl and index every webpage every time the web crawling is initiated. Some of the web pages are crawled more often than the others and if the data on a webpage changes it would not be shown in the search results up until the next crawl. This becomes extremely evident in web pages that don’t have such a high page rank and therefore are not crawled as often. Therefore there will be times when the user enters a query and the information returned by Google would not be up-to-date. Now of course the probability of this happening is considerably lower but this definitely does happen considering the sheer size of web and out of date information is returned by Google. This is one of the improvements that can be made to the algorithm that determines the frequency of crawl a web page. If some mechanism could be derived where the Googlebot is notified when a change is made to a webpage or a new web page is added to the World Wide Web, it would be a great improvement. This would make the search engine more up to date and reduce the wastage of resources while visiting the same unchanged webpage again and make optimum use of resources by doing a crawl only when a webpage changes or a new page appears

Some of techniques used by Google like replication are also used by other websites to improve their service. In the lectures we talked about how very busy websites like www.cnn.com use replication to have more than one server serving the requests from different clients, in the same way there are a number of Google clusters all around the world which all contain a number of replicated servers serving the requests. Just like with CNN, BBC or any other busy website load balancers are used in the Google’s case. However, the Google load balancers probably take in more things into account before the sending the request to a particular server. Just like we were told that dealing with mutable data is one of the main problems created by replication in the same way we are sure Google also has their researchers working on finding the best way to update the data in all its clusters worldwide every time the World Wide Web is crawled. Google’s technique of distributed crawling and parallel processing of the shards is quite similar to the techniques used in the area of parallel computing where multiple but related instructions are carried out simultaneously on different computers.

At the end it can be said that Google’s architecture, high availability, fast response times and high throughput all have made contribution to make it the most visited website on the internet. However if we were to chose one main factor which has resulted in its success it would have to be the high quality search results produced as a result of the PageRank algorithm invented by its founders. If it was not for the page rank algorithm Google might not even have come into existence or at least not as widespread as it is now! Bibliography

Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN System, 30(1-7), 107-117. Barroso, L.A. Dean, J. Holzle, U. (2003) Web search for a planet: The Google cluster architecture, Micro, IEEE 23(2), 22- 28. Austin, D. How Google Finds Your Needle in the Web's Haystack, Retrieved on September 10, 2008 from http://www.ams.org/featurecolumn/archive/pagerank.html Blackman, N. How Google Works, Retrieved on September 10, 2008 from http://www.googleguide.com/google_works.html