How does Google work? (I)

3342

This is the first of a series of articles that I will publish, unveiling some 'secrets' of this spectacular platform of services that work on the Web. 'secrets', because most of the information we will discuss with you was provided by Google and is available to anyone who wants to. Obviously we will try to use the simplest possible language, technically accessible to most of us. We believe that what we have to report to you is quite interesting. The reason we chose Google was because we found out that this corporation that started in a room university lab in Stanford, has become powerful to the point of becoming a provider of telecommunications, advertising, entertainment, computing and, above all, it begins to assume standardization / dynamising positions in absolutely emerging fields of research today, Big data which we will discuss later.

I wanted to start by telling you a little story: In 2004 / 2005 as a medium technician and Web programmer freelancer in ASP / ASP.Net / C # / SQL Server I attended the conference on UCAN which consecrated the already deceased Angolan Association of Free Software (ASL).
The conference was full of people with first-class luggage, such as Dr Pedro Teta, Eng Dimonekene Ditutala and Msc Matthew Padoca Draft. However, what really caught my attention was the statement by Dr Aires Veloso (an experienced programming teacher), who stressed that a serious system had to be created with a framework/ truth language, such as .Net or Java.
Those words came to mind, especially, because I had no parallel computing. Only years later, I realized that it made perfect sense for him to say that. That era was dominated by languages ​​for dynamic development like Cold Fusion Modules (CFM), ASP and one such PHP, lol. These were, at the time, languages ​​not oriented to objects and without support the parallel computation.

So it was only natural that a search and advertising service like Google that was growing at full thrust would be 'tempted' to choose platforms like ASP.Net and Java. But, no way !!! The genius of Sergey Brin and Lauwrence Page 'allowed them' to take another path: They chose to use a language that until then was unknown to the Web: The Python (this language that I had the opportunity to learn as a trainee in one of the governmental institutions).
In fact, Sergey and Larry did not choose it by chance. This language supported object orientation, being strongly typed (interesting data types, such as lists and dictionaries) and above all, it is free software. They had 'only' the challenge of porting to the Web, because until then there was no knowledge of the use of this language on the Web in a massive way, except in desktop with and also there, much more in applications for administering * .NIX systems.

This challenge has expired using the Common Gateway Interface (CGI) about the HTTP server Apache, that is, through this interface it was possible for non-native code to be executed by Apache. Some of you may remember that before 2000 and a little after that I was programming for the Web even with C language and C ++ (in my opinion something terrible) precisely because the CGI of Apache allowed this versatility. It was somewhat strange and numerous security issues involving memory overlap (Buffer Overflow) over HTTP were revealed. But this is outside our context.

This decision, far from being a kind of NERD style on the part of Sergey and Larry, was rather a long-term strategy that would influence the whole policy of growth of Google's infrastructure, that is, would be supported by products from below price, but high operational performance, which would allow its infrastructure to be easily scalable on low costs. If Google were to develop its infrastructure on frameworks such as ASP.Net or Java, it ran the risk of becoming too dependent on patents and technologies from companies that knew they would sooner or later be their business adversaries.

The problem of growth

With the increasing amount of information produced via the Web, Google's indexing services grew in size. For example from 1999 to 2009, that is in 10 years google indexed from 70 millions to many billions of documents, the average of processed orders / day increased about 1000 times, each document has 3 (three) times more information in the indexing services.

Managing such a large amount of information (indexes and documents) showed from the beginning a challenge for both Google staff.
As any winning project has a strong foundation, Google does not rule the rule. And its strength is also in the excellent scheme of its architecture, simple but above all intelligent.
At first Google adopted the following architecture (1997):

Google architecture on 1997
Google architecture on 1997

Despite its apparent simplicity, it was an already seemingly complex architecture possessing elements that are being used to this day. As we can see, there are two groups of servers (clusters) which play an important role behind research applications (Query) received by the server that we will call front (frontend): Indexing and Documentation Servers.
Indexing servers have some information about pages, such as an inverted index of a URL, like 'com.wordpress.snnangola'. The documentation servers store all the possible documents on the Web and randomly order them for fragments of documentation that we'll talk about below. Each document has properties such as a unique ID known as docid, a set of keywords that will match it to a possible search, and a Score assigned by the algorithm PageRank.

When a search request is sent to the front-end server, it forwards the indexing server that maps each word from the request to a list of relevant documents. The relevance (Score) or the degree of importance of the document is determined by the PageRank. This relevance determines the output order of the documents for the response to the user's request (better PageRank leaves first).

This apparent ease, however, hides a high complexity. As we have already said, with the increase in the number of documents, users and mobility on the Web, the number of search / second brutally, which could dangerously latency (the delay or RTT time) in the response to the users' requests, after all there would be more requests disputing among themselves, about who would be attended first. Luckily Google also has a solution to this problem by using parallel computing. As well?

From the above figure, we observe that both index servers and document servers have fragments (shards). It really matters to talk about the indexing fragments, because they deal directly with the requests of the users, via the front server, then the indexing serverswould be in theory more subject to stress. Well, this is no longer a problem because the fragments randomly own Indices for a subset of documents in the total index. This technique is known as partitioning of indexing by document.

This provides numerous advantages, for example, each fragment can process each request independently, it also improves the performance of the network traffic, and facilitates the management of the maintenance of information by document.
But it also has its cons, as each fragment needs to process each request, and each search word needs to be searched in each N fragments.
To minimize these effects, the requests made to a fragment are distributed to a pool of servers. Each fragment is also distributed in a pool of servers No. cluster indexing.

The search process can then be summarized as follows:

  1.  The user types a single word series (s) (for example: 'capital of angola');
  2.  The front server receives the request and sends it to one of the fragments located in one of the pools do cluster indexing;
  3.  The fragment given to the word (s) corresponds to an ordered list of documents composed of docid, score, etc.;
  4.  One of the fragments in one of the pools do cluster de documentation receives the message from the indexing and given the docid and the word (s), generates a set of documents, composed of title and excerpt. Of course with the docid it is already easier to locate the entire document on disk.

However, this raises another question: If one of the cluster's indexing / documentation goes down, for whatever reason, is research abolished? The answer lies in the distributed computing strategy that Google adopted from the beginning.

Distributed Computing Strategy

Google has always adopted a distributed computing strategy. This can even be perceived by its architecture already described in the figure above, but that we will update already, on another perspective in the figure below:

 

Architecture google 1997 supporting cache and Ad Sense
Architecture google 1997 supporting cache and Ad Sense

 

We can easily see the changes that have taken place:

  1. Replication Introduction
  2. Introduction of Caching

Portuguese researcher Jorge Cardoso in the book 'Programming of Distributed Systems in JAVA, FCA Publishing'wrote that'the principle of locality, admits that communication between computers follows two distinct patterns. First, a computer is more likely to communicate with computers that are closer than with computers that are further away. Second, a computer is likely to communicate with the same set of computers repeatedly'.

If you noticed, you clearly noticed that the first pattern is replication technique, and the second is the technique of Caching. From the figure above we can clearly see the existence of the two concepts. Exactly next to the front server we noticed the presence of a set of Caching that we don't know for sure if they belong to a cluster. However, this is very useful when users perform searches that have not been updated. In this case, if they have not changed then it is not necessary to cluster indexing, but for the supposed  cluster de Caching which forwards the previously searched documents to the user.

From the figure above we also note the existence of vertical replication in the indexing fragments of documentation. This is very important in that it explains why we rarely feel that a Google search is abolished or generates an error. We note that although a fragment goes down because of a logical or physical error in one of the servers cluster, by the ability of replication, the task automatically moves to another fragment belonging to another pool of servers. This is a marvel, since we believe that it is difficult for a pool whole failure, but even if it fails there will always be another pool to replace it.

By now we can see why Google can respond so quickly to our searches. Yes, there are other techniques and some of them will be discussed later, however, this is partly because they can keep images of the entire Web replicated in their cluster's of documentation for everyone.

There will theoretically not be many page cluster but no longer exist on the servers where they were housed. The Google system is smart enough to verify this, and update if possible, in the fastest way. But this sometimes did not happen so quickly, and for a very simple reason. The cluster's began to expand very rapidly, global scale.
Sometimes we do a search from Angola, and the Web results come from the Ireland, but Video (Youtube) comes from Brazil: AdSense do North Pole (exaggerated !!!) however, there is virtually no latency. How is all this possible? If one cluster documentation in the USA updates your documents, the cluster's em everyone need to do this automatically with as few delays as possible, otherwise high revenues are lost.
The next article will tell you about it.

LEAVE AN ANSWER

Please enter your comment!
Please enter your name here