Thoughts, Knowledge and Experience: August 2011

Google Platform

Current hardware

Servers are commodity-class x86 PCs running customized versions of Linux. The goal is to purchase CPU generations
that offer the best performance per dollar, not absolute performance, how this is measured is unclear but is likely to incorporate running costs of the entire server and CPU power consumption could be significant factor.
Servers as of 2009 consisted of a custom made open top server containing two processors (each with an unknown number of cores or interconnected processing units) a considerable amount of RAM spread over 8 DIMM slots housing double height DIMMS and two SATA hard drives connected through a standard ATX sized power supply. According to First April publication by CNET, Each server has a novel 12 volt battery to reduce costs and improve power efficiency

Estimates of the power required for over 450,000 servers range upwards of 20 megawatts, which cost on the order of US$2 million per month in electricity charges. The combined processing power of these servers might reach from 20 to 100 petaflops.

Specifications:

· In 2002; upwards of 15,000 servers ranging from 533 MHz Intel Celeron to dual 1.4 GHz Intel Pentium III (as of 2003[update]).

· One or more 80 GB hard disks per server (2003)

· 2–4 GB of memory per machine (2004)

· A 2005 estimate by Paul Strassmann has 200,000 servers,while unspecified sources claimed this number to be upwards of 450,000 in 2006.

· ~ 16 GB RAM, 2 TB disk space per machine (2009)

The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. A very old estimate (from 2000 while Google was in its infancy and had one product), Google's server farm consisted of 6,000 processors, 12,000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and one in Virginia.
Each site had an OC-48 (2488 Mbit/s) internet connection and an OC-12 (622 Mbit/s)
connection to other Google sites. The connections are eventually routed down to 4 × 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two Ethernet switches.[citation
needed]

Network topology

When a client computer attempts to connect to Google, several DNS servers resolve ww.google.com into multiple IP addresses via Round Robin policy. Furthermore, this acts as the first level of load
balancing and directs the client to different Google clusters. A Google cluster has thousands of servers and once the client has connected to the server additional load balancing is done to send
the queries to the least loaded web server. This makes Google one of the largest and most complex content delivery networks.

Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side), while new servers are 2U Rackmount systems. Each rack has a switch. Servers are connected via a 100 Mbit/s Ethernet link to the local switch. Switches are connected to core gigabit switch using one or two gigabit uplinks.[citation needed]

Software

Most of the software stack that Google uses on their servers was developed in-house. It is believed that C++, Java, and Python are favored over other programming languages. For example, the back-end of Gmail is written in Java and the back-end of Google Search is written in C++.
Google has acknowledged that Python has played an important role from the beginning, and that it continues to do so as the system grows and evolves.

The software that runs the Google infrastructure includes:

· Google Web Server — Custom Linux-based Web server that Google uses for its online services; according to Google, this is not based on Apache.

· Storage systems:

o Google File System and its successor, Colossus

o BigTable — structured storage built upon GFS/Colossus

o Spanner — planet-scale structured storage system, next generation of BigTable stack

· Chubby lock service

· Borg — job scheduling and monitoring system

· MapReduce and Sawzall programming language

· Indexing/search systems:

o TeraGoogle — Google's large search index (launched in early 2006), designed by Anna Paterson of Cuil fame.

o Caffeine (Percolator) — continuous indexing system (launched in 2010).

Google has developed several abstractions which it uses for storing most of its data:

· Protocol buffers — "Google's lingua franca for data", a binary serialization format which is widely used within the company.

· SSTable (Sorted Strings Table) — a persistent, ordered, immutable map from keys to values, where both keys and values are arbitrary byte strings. It is also used as one of the building blocks of BigTable.

· RecordIO — a sequence of variable sized records.

Software development practices

Most operations are read-only. when an update is required, queries are redirected to other servers, so as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different ducts in parallel, thus reducing the latency time.

To lessen the effects of unavoidable hardware failure, software is designed to be fault tolerant.
Thus, when a system goes down, data is still available on other servers, which increases reliability.

Search infrastructure

Index

Like most search engines, Google indexes documents by building a data structure known as inverted index. Such an index allows obtaining a list of documents by a query word. The index is very large due to the number of documents stored in the servers.

The index is partitioned by document IDs into many pieces called shards. Each shard is replicated onto multiple servers. Initially, the index was being served from hard disk drives, like it's done in
traditional information retrieval (IR) systems. Google dealt with increasing volume of queries by increasing number of replicas of each shard and thus increasing number of servers. Soon they had found that they had enough servers to keep a copy of the whole index in main memory (although with low replication or no replication at all), and in early 2001 Google switched to an in-memory index system. This switch had "radically changed many design parameters" of their search system, and allowed them to enjoy a big increase in throughput and a big decrease in latency of queries.

In June 2010 Google rolled out a next-generation indexing and serving system called "Caffeine" which can continuously crawl and update search index. Previously, Google updated its search index in batches using a series of Map Reduce jobs. The index was separated into several layers, some of which were updated faster than the others, and the main layer wouldn't be updated for as long as two weeks. With Caffeine the entire index is updated incrementally on a continuous basis. Later Google revealed a distributed data processing system called "Percolator" which is said to be the basis of Caffeine indexing system.

Some details about Google's inverted index compression schemes have been made public.

Server types

Google's server infrastructure is divided in several types, each assigned to a different purpose:

· Google web servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server.

· Data-gathering servers are permanently dedicated to spidering the Web. Google's web crawler is known as GoogleBot. They update the index and document databases and apply Google's algorithms to assign ranks to pages.

· Each index server contains a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain docid contain the query word. These servers need less disk space, but suffer the greatest CPU workload.

· Document servers store documents. Each document is stored on dozens of document servers.
When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space.

· Ad servers manage advertisements offered by services like AdWords and AdSense.

· Spelling servers make suggestions about the spelling of queries.

Thoughts, Knowledge and Experience

About Me

Wednesday, August 3, 2011

About Google