Thursday, September 4, 2014

Benchmark Testing of Hadoop Cluster with TestDFSIO

This blogpost will help the newbie of Hadoop to learn about the performance measurement of Hadoop cluster.
In this article I will give the details of an important benchmarking tools that is included in the Apache Hadoop distribution. Namely, we look at the benchmarks TestDFSIO. TestDFSIO is one of best industry standard benchmarks used in recent days.


Before we start few things should be installed/configured in your system.  One needs to configure Hadoop cluster on his system. For that download Hadoop from here.
Configure either of these two.
i) Single user Mode.
ii) Multi Cluster Mode.


The TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS, to discover performance bottlenecks in your network, to shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes) and to give you a first impression of how fast your cluster is in terms of I/O.

A source code of the of the documentation, can be found here.

Run write tests before read tests :

Always perform the write test in HDFS before the read test.
The read test of TestDFSIO does not generate its own input files. For this reason, it is a convenient  to run a write test   and then follow-up with reading the same data.

Run a write test :

The command to run a write test that generates 10000 output files each  of 1 GB is:

$ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10000 -fileSize 1000

Run a read test :

The command will read the generated 10000 output files each of size 1GB is:

$ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10000 -fileSize 1000

Cleaning up and remove test data :

$ hadoop jar hadoop-*test*.jar TestDFSIO -clean

This will clear the data of the directory /benchmarks/TestDFSIO on HDFS

TestDFSIO results :

----- TestDFSIO ----- : write
           Date & time: Fri Apr 08 2011
       Number of files: 10000
Total MBytes processed: 10000000
     Throughput mb/sec: 4.989
Average IO rate mb/sec: 5.185
 IO rate std deviation: 0.960
    Test exec time sec: 1813.53

----- TestDFSIO ----- : read
           Date & time: Fri Apr 08 2011
       Number of files: 10000
Total MBytes processed: 10000000
     Throughput mb/sec: 11.349
Average IO rate mb/sec: 22.341
 IO rate std deviation: 119.231
    Test exec time sec: 1144.842

 Here, the most notable metrics are Throughput mb/sec and Average IO rate mb/sec. Both of them are based on the file size written (or read) by the individual map tasks and the elapsed time to do so.

  I will come up with more article on Hadoop. Thanks