Thursday, February 19, 2015

HDFS Comic : The easiest way to learn Hadoop's File System.

Here is a comic that desribes the various functioning of HDFS.

Disclaimer : This has been taken from another resource.









Joins in Map-Reduce

There are 2 kinds of joins in Map-Reduce - map side join and the reduce side join.


  1. Map-side join - This join happens before the input reaches the map phase. It is suited for 2 scenarios:

One of the inputs is small enough to be fit in memory - Consider the example of some kind of a metadata which needs to be associated with a much larger number of records. In this particular case, the smaller input could be replicated across all the tasktracker nodes in memory and a join could be performed as the bigger input is being read by the mapper.

Both the inputs are sorted and partitioned into equal sizes with the guarantee that records belonging to a key fall in the same partition - Consider the example of outputs coming out of multiple reducer jobs which had equal number of reducers and the same keys emitted. In this case, an index could be built from one of the inputs (key, filename, offset) and it could be looked up as the other input is read.

         2.   Reduce-side join - This join happens at the reducer phase. It places no restrictions on the size of the input, the only disadvantage being that all the data/records (from both the inputs) have to go through the shuffle and sort phase. It works as following : The map phase tags the records with an identifier to distinguish the sources and the parsing logic at the reducer. Records pertaining to the same key reach the same reducer and the reducer takes care of joining, taking care of the fact that records from different source tags need to be parsed and dealt with differently.

Friday, February 6, 2015

Optimum Parameters for Best Practices of Large-Scale File System


 Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders if used correctly and wisely. We all know the usage of these tools. But there are some points, if followed can take the core efficiency out of these tools. 

 Choosing the number of map and reduce tasks for a job is important.



a. If each tasks takes less than 30-40 seconds, reduce the number of tasks. The task setup and scheduling overhead is a few seconds, so if tasks finish very quickly, you are wasting time while not doing work. In simple words, your task is under loaded. Better increase the task load and utilize it to the fullest. Another option can be the reuse of JVM. The JVM spawned by one mapper can be reused by the other one, so that there is no overhead of spawning of an extra JVM.


b. If you are dealing with a huge input data size, for example, suppose 1TB, then consider increasing the block size of the input dataset to 256M or 512M, so that less number of mappers will be spawned. Increasing the number of mappers by decreasing the block size is not a good practice. Hadoop is designed to work on larger amount of data to reduce the disk seek time and increase the computation speed. So always define the HDFS block size larger enough to allow Hadoop to compute effectively.



c. If you have 50 map slots in your cluster, avoid jobs using 51 or 52 mappers, because the first 50 mappers finish at the same time and then the 51st and the 52nd will run before the reducer task can be started. So just increasing the number of mappers to 500, or 1000 or even to 2000 does not speed your job. The mappers will run in parallel according to the map slots available in your cluster. If map slot available is 50 only 50 will run in parallel, others will be in queue, waiting for the map slots to be available.


d. The number of reduce tasks should always be equal less than the reduce slot available in your cluster.


e. Sometime we don’t really use reducers. For example filtering and reduce noise in data. In these cases make sure you set the number of reducers to zero since the sorting and shuffling is an expensive operation.