This blogpost will help the newbie of Hadoop to learn about the performance measurement of Hadoop cluster.
In this article I will give the details of an important benchmarking tools that is included in the Apache Hadoop distribution. Namely, we look at the benchmarks TestDFSIO. TestDFSIO is one of best industry standard benchmarks used in recent days.
Prerequisites
Before we start few things should be installed/configured in your system. One needs to configure Hadoop cluster on his system. For that download Hadoop from here.
Configure either of these two.i) Single user Mode.
ii) Multi Cluster Mode.
TestDFSIO :
The TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS, to discover performance bottlenecks in your network, to shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes) and to give you a first impression of how fast your cluster is in terms of I/O.
Run write tests before read tests :
Always perform the write test in HDFS before the read test.
The read test of TestDFSIO does not generate its own input files. For this reason, it is a convenient to run a write test and then follow-up with reading the same data.
Run a write test :
The command to run a write test that generates 10000 output files each of 1 GB is:
$ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10000 -fileSize 1000
Run a read test :
The command will read the generated 10000 output files each of size 1GB is:
$ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10000 -fileSize 1000
Cleaning up and remove test data :
$ hadoop jar hadoop-*test*.jar TestDFSIO -clean
This will clear the data of the directory /benchmarks/TestDFSIO on HDFS
TestDFSIO results :
----- TestDFSIO ----- : write
Date & time: Fri Apr 08 2011
Number of files: 10000
Total MBytes processed: 10000000
Throughput mb/sec: 4.989
Average IO rate mb/sec: 5.185
IO rate std deviation: 0.960
Test exec time sec: 1813.53
----- TestDFSIO ----- : read
Date & time: Fri Apr 08 2011
Number of files: 10000
Total MBytes processed: 10000000
Throughput mb/sec: 11.349
Average IO rate mb/sec: 22.341
IO rate std deviation: 119.231
Test exec time sec: 1144.842
Here, the most notable metrics are Throughput mb/sec and Average IO rate mb/sec. Both of them are based on the file size written (or read) by the individual map tasks and the elapsed time to do so.
I will come up with more article on Hadoop. Thanks
Date & time: Fri Apr 08 2011
Number of files: 10000
Total MBytes processed: 10000000
Throughput mb/sec: 4.989
Average IO rate mb/sec: 5.185
IO rate std deviation: 0.960
Test exec time sec: 1813.53
----- TestDFSIO ----- : read
Date & time: Fri Apr 08 2011
Number of files: 10000
Total MBytes processed: 10000000
Throughput mb/sec: 11.349
Average IO rate mb/sec: 22.341
IO rate std deviation: 119.231
Test exec time sec: 1144.842
Here, the most notable metrics are Throughput mb/sec and Average IO rate mb/sec. Both of them are based on the file size written (or read) by the individual map tasks and the elapsed time to do so.
I will come up with more article on Hadoop. Thanks