Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders if used correctly and wisely. We all know the usage of these tools. But there are some points, if followed can take the core efficiency out of these tools.
Choosing the number of map and reduce tasks for a job is important.
a. If each tasks takes less than 30-40 seconds, reduce the number of tasks. The task setup and scheduling overhead is a few seconds, so if tasks finish very quickly, you are wasting time while not doing work. In simple words, your task is under loaded. Better increase the task load and utilize it to the fullest. Another option can be the reuse of JVM. The JVM spawned by one mapper can be reused by the other one, so that there is no overhead of spawning of an extra JVM.
b. If you are dealing with a huge input data size, for example, suppose 1TB, then consider increasing the block size of the input dataset to 256M or 512M, so that less number of mappers will be spawned. Increasing the number of mappers by decreasing the block size is not a good practice. Hadoop is designed to work on larger amount of data to reduce the disk seek time and increase the computation speed. So always define the HDFS block size larger enough to allow Hadoop to compute effectively.
c. If you have 50 map slots in your cluster, avoid jobs using 51 or 52 mappers, because the first 50 mappers finish at the same time and then the 51st and the 52nd will run before the reducer task can be started. So just increasing the number of mappers to 500, or 1000 or even to 2000 does not speed your job. The mappers will run in parallel according to the map slots available in your cluster. If map slot available is 50 only 50 will run in parallel, others will be in queue, waiting for the map slots to be available.
d. The number of reduce tasks should always be equal less than the reduce slot available in your cluster.
e. Sometime we don’t really use reducers. For example filtering and reduce noise in data. In these cases make sure you set the number of reducers to zero since the sorting and shuffling is an expensive operation.