Wednesday, November 12, 2014

Hadoop Distributed File System(HDFS)

**** BLOCKS ****
All filesystem has a block size, a block size refers to the minimum amount of data that it can read or write. Filesystem blocks are typically a few kilobytes of normally 512 bytes.
Whereas in hadoop filesystem are having a much larger size of blocks i.e 64MB by default.

Now a question arises why the block system of hadoop are of big size ?

To start this - HDFS is meant to handle large files. If you have small files, smaller block sizes are better. If you have large files, larger block sizes are better.
For large files, lets say you have a 1000MB file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic! isn't it . If you use 64Mb blocks, the number of requests goes down to 16, greatly reducing the cost of overhead and load on the Name Node.

**** Namenodes and Datanodes ****
Hadoop cluster has two types of nodes which are operating in a master-slave pattern.
1) A Namenode (Master)
2) 'N' Datanodes (Slave)
* N is the number of machines.

The Namenode contains the filesystem namespace. It has the filesystem tree information and the metadata for all files and directories within the tree.This information is stored on the regular interval in the local disk in two formats :
1) Namespace image
2) Edit logs

NOTE : A Namenode also knows the datanodes on which the blocks for a given file is been stored. Namenode does not store the block location on the regular basis as the location of block is always rewriten when ever the system started.

A client always access the files with communicating with Namenode and Datanodes.

Datanode stores and retrieve the blocks when ever they asked to do by client or by Namenode. Dataode also report back to the Namenode about the information of list of block they are storing.
Without the namenode the filesystem or we can say the datanodes or in other words the blocks which contains our data cannot be accessed. This is the single point of failure till hadoop 1.x, but from hadoop 2.x we have different architecture of namenode configuration which contains the secondary namenode concepts.

The secondary Namenode also has serious bottleneck, which could not solve the Single Point of Failure of hadoop cluster.

In the upcoming post, I will explain more about the recent work about this issue.

No comments: