Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders
if used correctly and wisely. We all know the usage of these tools. But there
are some points, if followed can take the core efficiency out of these tools.
Choosing the number of map and reduce
tasks for a job is important.
a. If each tasks takes less than
30-40 seconds, reduce the number of tasks. The task setup and scheduling
overhead is a few seconds, so if tasks finish very quickly, you are wasting
time while not doing work. In simple words, your task is under loaded. Better
increase the task load and utilize it to the fullest. Another option can be the
reuse of JVM. The JVM spawned by one mapper can be reused by the other one, so
that there is no overhead of spawning of an extra JVM.
b. If you are dealing with a
huge input data size, for example, suppose 1TB, then consider increasing the
block size of the input dataset to 256M or 512M, so that less number of mappers
will be spawned. Increasing the number of mappers by decreasing the block size
is not a good practice. Hadoop is designed to work on larger amount of data to
reduce the disk seek time and increase the computation speed. So always define
the HDFS block size larger enough to allow Hadoop to compute effectively.
c. If you have 50 map slots in
your cluster, avoid jobs using 51 or 52 mappers, because the first 50 mappers
finish at the same time and then the 51st and the 52nd will run before the reducer task can be started. So
just increasing the number of mappers to 500, or 1000 or even to 2000 does not
speed your job. The mappers will run in parallel according to the map slots
available in your cluster. If map slot available is 50 only 50 will run in
parallel, others will be in queue, waiting for the map slots to be available.
d. The number of reduce tasks
should always be equal less than the reduce slot available in your cluster.
e. Sometime we don’t really use
reducers. For example filtering and reduce noise in data. In these cases make
sure you set the number of reducers to zero since the sorting and shuffling is
an expensive operation.
No comments:
Post a Comment