Introduction of Google File System

David Lee's picture

Why can Google dominate the search engine market? One important reason is the excellent performance relies on the file system. Google has designed a unique distributed file system to meet its huge storage demand, known as Google File System (GFS). Google did not release GFS as open source software, but still released some technical details, including an official paper.

There are two mainly differences between GFS and traditional distributed file system. First, component failures are the norm rather than the exception. The failures can be caused by application bugs, operating system bugs, human errors, and even hardware or network problems. Since even the expensive hard disk device can not completely exclude all failures, Google just simply builds its storage machine by mutiple inexpensive comodity, and against failures through integrate constant monitoring, error detection, fault tolerance, and automatic recovery to GFS.

Second, most files are mutated by appending new data rather than overwriting or removing existing data. Once written, data are usually need only to be readable but not writable. And most reading operations are “large streaming reads”, where individual operations typically read hundreds of KBs, more commonly 1 MB or more. Notice that the system stores a modest number of large files, each typically 100 MB or larger in size. GFS support small files, but does not optimize for them.

The architecture of GFS similar to the supernode (Master) and distributed nodes (chunkservers) approach. Real data will be stored in chunkservers, which report their state to Master periodically. When a client wants to read a file, it query to Master about the state of target chunkserver, and Master response the location of chunkserver if it is in idle stage. Hence the client can request chunk data to the chunkserver.

GFS supports the large volume and flows for Google search engine. On the other hand, BigTable, a database system used by a number of Google applications such as Gmail, Google Maps, Youtube and other cloud services, also built on GFS. We can say that GFS is the killer technology in the cloud generation.

More details can refer to the GFS: http://labs.google.com/papers/gfs.html