Friday, December 26, 2014

Top few NameNode-related Problems

In this blog post I wanted to share few NameNode-related issues that came up frequently in my research:

  • We want High Avalability(HA), but the NameNode is a single point of failure (SPOF). This results in downtime due to hardware failures and user errors. In addition, it is often non-trivial to recover from a NameNode failure, so our Hadoop administrators always need to be on call.
  • We want to run Hadoop with 100% commodity hardware. To run HDFS in production and not lose all our data in the event of a power outage, HDFS requires us to deploy a commercial NAS to which the NameNode can write a copy of its edit log. In addition to the prohibitive cost of a commercial NAS, the entire cluster goes down any time the NAS is down, because the NameNode needs to hard-mount the NAS (for consistency reasons).
  • We need both a NameNode and a Secondary NameNode. We read some documentation that suggested purchasing higher-end servers for these roles (e.g., dual power supplies). We only have 20 nodes in the cluster, so this represents a 15-20% hardware cost overhead with no real value (i.e., it doesn’t contribute to the overall capacity or throughput of the cluster).
  • We have a significant number of files. Even though we have hundreds of nodes in the cluster, the NameNode keeps all its metadata in memory, so we are limited to a maximum of only 50-100M files in the entire cluster. While we can work around that by concatenating files into larger files, that adds tremendous complexity. (Imagine what it would be like if you had to start combining the documents on your laptop into zip files because there was a severe limit on how many files you could have.)
  • We have a relatively small cluster, with only 10 nodes. Due to the DataNode-NameNode block report mechanism, we cannot exceed 100-200K blocks (or files) per node, thereby limiting our 10-node cluster to less than 2M files. While we can work around that by concatenating files into larger files, that adds tremendous complexity.
  • We need much higher performance when creating and processing a large number of files (especially small files). Hadoop is extremely slow.
  • We have had outages and latency spikes due to garbage collection on the NameNode. Although  CMS (concurrent mark and sweep) can be used for garbage collector, the NameNode still freezes occasionally, causing the DataNodes to lose connectivity (i.e., become blacklisted).
  • When we change permissions on a file (chmod 400 arch), the changes do not affect existing clients who have already opened the file. We have no way of knowing who the clients are. It’s impossible to know when the permission changes would really become effective, if at all.
  • We have lost data due to various errors on the NameNode. In one case, the root partition ran out of space, and the NameNode crashed with a corrupted edit log.

Monday, December 15, 2014

The Top 10 Big Data Quotes of All Time

  1. “Big data is at the foundation of all the megatrends that are happening today, from social to mobile to cloud to gaming.” – Chris Lynch, Vertica Systems
  2. “Big data is not about the data” – Gary King, Harvard University, making the point that while data is plentiful and easy to collect, the real value is in the analytics.
  3. “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.” – Eric Schmidt, of Google, said in 2010.
  4. “Information is the oil of the 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Gartner Research
  5. “I keep saying that the sexy job in the next 10 years will be statisticians, and I'm not kidding” – Hal Varian, Google
  6.  “You can have data without information, but you cannot have information without data.” Daniel Keys Moran, computer programmer and science fiction author
  7.  “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.” – Atul Butte, Stanford School of Medicine
  8. “Errors using inadequate data are much less than those using no data at all.” Charles Babbage, inventor and mathematician
  9. “To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem – he may be able to say what the experiment died of.” – Ronald Fisher, biologist, geneticist and statistician
  10. “Without big data, you are blind and deaf in the middle of a freeway” – Geoffrey Moore, management consultant and theorist

Thursday, December 11, 2014

Mongo-Hadoop Connector: The Last One I Used.

Last week during my project work, I used mongo-hadoop connector to create link table in 

To use mongo-hadoop connector, you need relevant jars in /usr/lib/hadoop, /usr/lib/hive based on your installation.

You can link a collection in MongoDB to a Hive table as below:

id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
It will create a persons table in HIVE which will show documents in collection or we can say its a kind of VIEW to the documents in collection.
When you execute:
select * from persons;
It will show all the documents present in collection test.persons.
But one MAJOR PITFALL here is that when you execute:
drop table persons;
It will drop collection in MongoDB which will be a BLUNDER when you link a collection with 500 million records (and I had done that mistake!!)
So, when you link a collection in HIVE, always create a EXTERNAL table like:
id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
Using this approach, even when you drop the table in HIVE, your collection in MongoDB will remain intact and you can escape the pain to again import 500 million records.
Sometimes small pointers can make HUGE difference!!

How to Use HDFS Programmatically

While HDFS can be manipulated explicitly through user commands, or implicitly as the input to or output from a Hadoop MapReduce job, you can also work with HDFS inside your own Java applications. (A JNI-based wrapper, libhdfs also provides this functionality in C/C++ programs.)

This section provides a short tutorial on using the Java-based HDFS API. It will be based on the following code listing:

1: import;
2: import;
4: import org.apache.hadoop.conf.Configuration;
5: import org.apache.hadoop.fs.FileSystem;
6: import org.apache.hadoop.fs.FSDataInputStream;
7: import org.apache.hadoop.fs.FSDataOutputStream;
8: import org.apache.hadoop.fs.Path;
10: public class HDFSHelloWorld {
12: public static final String theFilename = “hello.txt”;
13: public static final String message = “Hello, world!\n”;
15: public static void main (String [] args) throws IOException {
17: Configuration conf = new Configuration();
18: FileSystem fs = FileSystem.get(conf);
20: Path filenamePath = new Path(theFilename);
22: try {
23: if (fs.exists(filenamePath)) {
24: // remove the file first
25: fs.delete(filenamePath);
26: }
28: FSDataOutputStream out = fs.create(filenamePath);
29: out.writeUTF(message;
30: out.close();
32: FSDataInputStream in =;
33: String messageIn = in.readUTF();
34: System.out.print(messageIn);
35: in.close();
46: } catch (IOException ioe) {
47: System.err.println(“IOException during operation: ” + ioe.toString());
48: System.exit(1);
49: }
40: }
41: }

This program creates a file named hello.txt, writes a short message into it, then reads it back and prints it to the screen. If the file already existed, it is deleted first.

First we get a handle to an abstract FileSystem object, as specified by the application configuration. The Configuration object created uses the default parameters.

17: Configuration conf = new Configuration();
18: FileSystem fs = FileSystem.get(conf);

The FileSystem interface actually provides a generic abstraction suitable for use in several file systems. Depending on the Hadoop configuration, this may use HDFS or the local file system or a different one altogether. If this test program is launched via the ordinary ‘java classname’ command line, it may not find conf/hadoop-site.xml and will use the local file system. To ensure that it uses the proper Hadoop configuration, launch this program through Hadoop by putting it in a jar and running:

$HADOOP_HOME/bin/hadoop jar yourjar HDFSHelloWorld

Regardless of how you launch the program and which file system it connects to, writing to a file is done in the same way:

28: FSDataOutputStream out = fs.create(filenamePath);
29: out.writeUTF(message);
30: out.close();

First we create the file with the fs.create() call, which returns an FSDataOutputStream used to write data into the file. We then write the information using ordinary stream writing functions; FSDataOutputStream extends the class. When we are done with the file, we close the stream with out.close().

This call to fs.create() will overwrite the file if it already exists, but for sake of example, this program explicitly removes the file first anyway (note that depending on this explicit prior removal is technically a race condition). Testing for whether a file exists and removing an existing file are performed by lines 23-26:

23: if (fs.exists(filenamePath)) {
24: // remove the file first
25: fs.delete(filenamePath);
26: }

Other operations such as copying, moving, and renaming are equally straightforward operations on Path objects performed by the FileSystem.

Finally, we re-open the file for read, and pull the bytes from the file, converting them to a UTF-8 encoded string in the process, and print to the screen:

32: FSDataInputStream in =;
33: String messageIn = in.readUTF();
34: System.out.print(messageIn);
35: in.close();

The method returns an FSDataInputStream, which subclasses Data can be read from the stream using the readUTF() operation, as on line 33. When we are done with the stream, we call close() to free the handle associated with the file.

More information:

Complete JavaDoc for the HDFS API is provided at

Another example HDFS application is available on the Hadoop wiki. This implements a file copy operation.

Sunday, December 7, 2014

Few Common Problems You Face While Installing Hadoop

Some common problems, that a person faces while installing Hadoop. Here are few problems listed below.

1. Problem with ssh configuration.
 error: connection refused to port 22

2. NameNode node not reachable
    error: Retrying to connect

1. Problem with ssh configuration: In this case you may face many kind of errors, but most common one while installing Hadoop is connection refused to port 22. Here you should check if machine on which you are trying to login, should have ssh server installed.
If you are using Ubuntu, you can install ssh server using following command.
   $sudo apt-get install openssh-server
   On CentOs or Redhat you can install ssh server using yum package manager
   $sudo yum install openssh-server
   After you have installed ssh server, make sure you have configured the keys properly and share public key with the machine that you want to login into. If the problem persists then check for configurations of ssh in your machine. you can check configuration in /etc/ssh/sshd_config file. use following command to read this file
   $sudo gedit /etc/ssh/sshd_config
   In this file RSA Authentication should be set to yes, password less authentication also should be yes.
   after this close the file and restart ssh with following command
   $sudo /etc/init.d/ssh restart
   Now your problem should be resolved. Apart from this error you can face one more issue. Even though you have configured keys correctly, ssh is still prompting for password. In that case check if keys are being managed by ssh. For that run following command. your keys should be in 

   $HOME/.ssh folder

 2. If your Namenode is not reachable, first thing you should check is demons running on Namenode machine. you can check that with following command

   This command tells you all java processes running on your machine. If you don't see Namenode in the output list, do the following. Stop the hadoop with following command.
   Format the Namenode using following command
   $HADOOP_HOME/bin/hadoop namenode -format
   start hadoop with following command
   This time Namenode should run. if you are still not able to start namenode. then check for core-site.xml file in conf directory of hadoop with following command
   $gedit $HADOOP_HOME/conf/core-site.xml
   Check for value for property hadoop.tmp.dir. it should be set to a path where user who is trying to run hadoop has write permissions. if you dont want to scratch your head on this set it to $HOME/hadoop_tmp directory. Now save and close this file. Format the namenode again and try starting hadoop again. Things should work this time.

   Thats all for this posts, Please share problems that you are facing, we will try to solve them together. stay tuned for more stuff :)   

Friday, December 5, 2014

Hive : How to install it on top of Hadoop in Ubuntu

What is Apache Hive?

Apache Hive is a data warehouse infrastructure that facilitates querying and managing large data sets which resides in distributed storage system. It is built on top of Hadoop and developed by Facebook. Hive provides a way to query the data using a SQL-like query language called HiveQL(Hive query Language).

Internally, a compiler translates HiveQL statements into MapReduce jobs, which are then submitted to Hadoop framework for execution.

Difference between Hive and SQL?

Hive looks very much similar like traditional database with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences:

As Hadoop is intended for long sequential scans and Hive is based on Hadoop, you would expect queries to have a very high latency. It means that Hive would not be appropriate for those applications that need very fast response times, as you can expect with a traditional RDBMS database.

Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Hive Installation on Ubuntu:

Follow the below steps to install Apache Hive on Ubuntu:

Step 1:  Download Hive tar.

Download the latest Hive version from here

Step 2: untar the file.

Step 3: Edit the “.bashrc” file to update the environment variables for user.

   $sudo gedit .bashrc

Add the following at the end of the file:

export HADOOP_HOME=/home/user/hadoop-2.4.0
export HIVE_HOME=/home/user/hive-0.14.0-bin
export PATH=$PATH:$HIVE_HOME/bin

Step 4:  Create Hive directories within HDFS.

NOTE: Run the commands from bin folder of hadoop[installed]

$hadoop fs -mkdir /user/hive/storage

The directory ‘storage’ is the location to store the table or data related to hive.

$hadoop fs -mkdir /tmp

The temporary directory ‘tmp’is the temporary location to store the intermediate result of processing.

Step 5: Set read/write permissions for table.

In this command we are giving written permission to the group:

$hadoop fs -chmod 774  /user/hive/warehouse

$hadoop fs -chmod 774  /tmp

Step 6:  Set Hadoop path in Hive

cd hadoop // my current directory where hadoop is stored.
cd hive*-bin
cd bin
sudo gedit

In the configuration file , add the following

export HADOOP_HOME=/home/user/hadoop-2.4.0

Step 7: Launch Hive.

***[run from bin of Hive] 

Command: $hive

Step 8: Test your setup

$show tables; 

Don't forget to put semicolon after this command :P
Press Ctrl+C to exit Hive 

Happy Hadooping! ;) 

Wednesday, December 3, 2014

Books and Related Stuffs You Need to Get Started with Hadoop

In my last few blogs, I provided the basic knowledge of Hadoop and HDFS. Few on Mapreduce too. The emails I got from various readers of the blogs are appreciating. Many of the readers got attracted to Big Data and Hadoop Technology. 

For a further help, I  would like to let you know about a few of the books and web resources from where you can start reading the same. This blog is dedicated to the same.

So if you are reading my blog articles and interested in learning Hadoop, you must be familiar about the power of Big Data and why people are going gaga over this Big Data.

You can refer to these small articles about Big Data ,HDFS and Mapreduce.

You may like to read about Pre-requisites for getting started with Big Data Technologies to get yourself started with Big Data technologies.

Now the main topic of the blog.

The first book i would recommend you guys out there will be: Hadoop The Definitive Guide 3rd Edition by Tom White. I started my Big Data Journey with this book and believe me it is the best resource for you if you are naive in the Big Data World. The book is elegantly written to understand the concept topic-wise. It also gives you an Example of Wearther Dataset which is carried almost through out the book to help you understand how things go in hadoop.

The second book I like reading and which is also very helpful is: Hadoop in Practice by Alex Holmes. Hadoop in Practice collects 85 battle-tested examples and presents them in a problem/solution format. It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning how to build a specific solution along with the thinking that went into it. As a bonus, the book's examples create a well-structured and understandable codebase you can tweak to meet your own needs.

The third one which is written real simpl will be: Hadoop in Action by Chuck Lam. Hadoop in Action teaches readers how to use Hadoop and write MapReduce programs. The intended readers are programmers, architects, and project managers who have to process large amounts of data offline. Hadoop in Action will lead the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs. 
Note: this book uses old Hadoop API

And lastly if you are more into administrative side you can go for Hadoop's Operations by Eric Sammer. Along with the development this book talks mainly about administrating and maintenance of huge clusters for large data-set in the production environment. Eric Sammer, Principal Solution Architect at Cloudera, shows you the particulars of running Hadoop in production, from planning, installing, and configuring the system to providing ongoing maintenance.

Well these are the books that you can refer for your understanding and better conceptual visualization and practical Hands-on of working with Hadoop Farmework

Apart from these books if want to go for the API, you can see Hadoop API Docs here and also very useful is: Data-Intensive Text Processing with MapReduce

Hope you will find these books and resources helpful to understand in-depth of Hadoop and its power.

If you have any question or you want any specific tutorial on Hadoop you can go request for the same in the email address. I will try to get back to you as soon as possible :)


Tuesday, December 2, 2014

How to Install Hadoop on Ubuntu or any other Linux Platform

Following are the steps for installing Hadoop. I have just listed the steps with very brief explanation at some places. This is more or less like some reference notes for installation. I made a note of this when I was installing Hadoop on my system for the very first time.

Please let me know if you need any specific details.

Installing HDFS (Hadoop Distributed File System)
OS : Ubuntu

Installing Sun Java on Ubuntu
$sudo apt-get update
$sudo apt-get install oracle-java7-installer
$sudo update-java-alternatives -s java-7-oracle

Create hadoop user

$sudo addgroup hadoop
$sudo adduser —ingroup hadoop hduser

Install SSH Server if not already present. This is needed as hadoop does an ssh into localhost for execution.

$ sudo apt-get install openssh-server
$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Installing Hadoop

Download hadoop from Apache Downloads.

Download link for latest hadoop 2.6.0 can be found  here
Download hadoop-2.6.0.tar.gz from the link.
mv hadoop-2.6.0  hadoop

Edit .bashrc

# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Add Hadoop bin/ directory to PATH


We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:

$gedit /home/hduser/hadoop/conf/

Add the following

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Temp directory for hadoop

$mkdir /home/hduser/tmp

Configurations for hadoop

cd home/hduser/hadoop/conf/

Then add the following configurations between <configuration> .. </configuration> xml elements:


<description>A base for other temporary directories.</description>

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>


Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:

<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

Formatting NameNode

You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.

Run the following command

$/home/hduser/hadoop/bin/hadoop namenode -format

Starting Hadoop Cluster

From hadoop/bin


To check for processes running use:


If jps is not installed, do the following

sudo update-alternatives --install /usr/bin/jps jps /usr/lib/jvm/jdk1.6/bin/jps 1

Tasks running should be as follows:


NOTE : This is for single node setup.If you configure it for cluster node setup, the demons will be shown in the specific serves.

Stopping Hadoop Cluster

From hadoop/bin


Example Application to test success of hadoop:

Follow my this post to test whether it is successfully configured or not :)

For any other query, feel free to comment in the below thread. :)

Happy Hadooping.

Pre-requisites for getting started with Big Data Technologies

Lots of my friends who have heard about Big Data World or may be interested in getting into the same, have this query, what are the prerequisite for learning or may be start digging into Big Data Technologies. And what are the technologies that comes under Big Data.

Well this is quite a difficult question to answer, because there is no distinct draw between what comes under the hood. But one thing is for sure that Big Data is not only about Hadoop as lots of us out there have this misconception.

Hadoop is just a framework that is being used in Big Data. And yes it is used quite a lot or if i can say it is one of the integral part of Big Data. But beside Hadoop there are tons of tools and technologies that comes under the same. To name a few we have:

  • Cassandra 
  • HBase             
  • MongoDB              
  • CouchDB              
  • Accumulo        
  • HCatlog     
  •  Hive                      
  •  Sqoop       
  •  Flume          and many more! 

 OK, now if we look at the NoSql Databases (that's what we call databases handling unstructured data in Big Data) and different tools, mentioned above, most of them (few being exception) is written in JAVA including Hadoop. So as a programmer if you want to know and go in depth of the architectural APIs, Core Java is the recommended programming language that will help you to grasp the technology in a better and more efficient way.

Now if i am saying that core java is recommended that doesn't imply that people who don't know Java, have no scope in the same. Because Big data is all about managing the data more efficiently, more intelligently.

So people who have the knowledge of data warehousing gets a plus point here. Managing large amount of data and playing around the same with its volume, velocity variety and complexity is the work of a Big Data Scientist.

Apart from Data warehousing background, People having experience with Machine learning, Theory of Computation, and Sentiment Analysis are contributing a lot in this World.

So it will be unfair to say, that who can and who cannot work in Big Data technology. Its an emerging field where most of us can lay our hand and can contribute in its growth and development, And Yes the most important thing that's what I like about being in Big Data is that, most of the tools are open source. So I can play around with the Source Code :)

For any query feel free to mail me

Friday, November 28, 2014

Hello Hadoop: Welcoming Parallel Processing

Even the largest computers struggle with complex problems that have a lot of variables and large data sets. Imagine if one person had to sort through 26,000 boxes of large balls containing sets of 1,000 balls each with one letter of the alphabet: the task would take days. But if you separated the contents o the 1,000 unit boxes into 10 smaller equal boxes and asked 10 separate people to work on these smaller tasks, the job would be completed 10 times faster. This notion of parallel processing is one of the cornerstones of many Big Data projects.

Apache Hadoop (named after the creator Doug Cutting’s child’s toy elephant) is a free programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop is part of the Apache project sponsored by the Apache Software Foundation and although it originally used Java, any programming language can be used to implement many parts of the system.

Hadoop was inspired by Google’s Map-Reduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any computer connected in an organised group called a cluster. Hadoop makes it possible to run applications on thousands of individual computers involving thousands of terabytes of data. Its distributed file system facilitates rapid data transfer rates among nodes and enables the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of computers become inoperative.

Saturday, November 15, 2014

Why use Hadoop? Any other solution to Large-Scale Data ?

Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

How large an amount of work? Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.

Challenges at Large Scale

Performing large-scale computation is difficult. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway.

In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly transmitted. Multiple implementations or versions of client software may speak slightly different protocols from one another. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress. Of course, actually providing such resilience is a major software engineering challenge.

Different distributed systems specifically address certain modes of failure, while worrying less about others. Hadoop provides no security model, nor safeguards against maliciously inserted data. For example, it cannot detect a man-in-the-middle attack between nodes. On the other hand, it is designed to handle hardware failure and data congestion issues very robustly. Other distributed systems make different trade-offs, as they intend to be used for problems with other requirements (e.g., high security).

In addition to worrying about these sorts of bugs and challenges, there is also the fact that the compute hardware has finite resources available to it. The major resources include:

* Processor time
* Memory
* Hard drive space
* Network bandwidth

Individual machines typically only have a few gigabytes of memory. If the input data set is several terabytes, then this would require a thousand or more machines to hold it in RAM -- and even then, no single machine would be able to process or address all of the data.

Hard drives are much larger; a single machine can now hold multiple terabytes of information on its hard drives. But intermediate data sets generated while performing a large-scale computation can easily fill up several times more space than what the original input data set had occupied. During this process, some of the hard drives employed by the system may become full, and the distributed system may need to route this data to other nodes which can store the overflow.

Finally, bandwidth is a scarce resource even on an internal network. While a set of nodes directly connected by a gigabit Ethernet may generally experience high throughput between them, if all of the machines were transmitting multi-gigabyte data sets, they can easily saturate the switch's bandwidth capacity. Additionally if the machines are spread across multiple racks, the bandwidth available for the data transfer would be much less. Furthermore RPC requests and other data transfer requests using this channel may be delayed or dropped.

To be successful, a large-scale distributed system must be able to manage the above mentioned resources efficiently. Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.

Synchronization between multiple machines remains the biggest challenge in distributed system design. If nodes in a distributed system can explicitly communicate with one another, then application designers must be cognizant of risks associated with such communication patterns. It becomes very easy to generate more remote procedure calls (RPCs) than the system can satisfy! Performing multi-party data exchanges is also prone to deadlock or race conditions. Finally, the ability to continue computation in the face of failures becomes more challenging. For example, if 100 nodes are present in a system and one of them crashes, the other 99 nodes should be able to continue the computation, ideally with only a small penalty proportionate to the loss of 1% of the computing power. Of course, this will require re-computing any work lost on the unavailable node. Furthermore, if a complex communication network is overlaid on the distributed infrastructure, then determining how best to restart the lost computation and propagating this information about the change in network topology may be non trivial to implement.

Friday, November 14, 2014

Pig: the Silent Hero

SQL on Hadoop has been extensively covered in the media in the last year. Pig, being a well-established technology, has been largely overlooked though Pig as a Service was a noteworthy development. Considering Hadoop as a data platform though requires Pig and an understanding why and how it is important.  Data users are generally trained in using SQL, a declarative language, to query for data for reporting, analytic and ad-hoc explorations. SQL does not describe how the data is processed; it is more declarative and appeals to a lot of data users. ETL(Extract, Transform and Load)  processes, which are developed by data programmers, benefit and sometimes even require the ability to detail the data transformation steps. At times ETL programmers like a procedural language as opposed to a declarative language. Pig’s programming language, Pig Latin, is procedural and gives programmers control over every step of the processing.  Business users and programmers work on the same data set yet usually focus on different stages. The programmers commonly work on the whole ETL pipeline, i.e. they are responsible to clean and extract the raw data, transform it and load it into third party systems. Business users either access data on third party systems or access the extracted and transformed data for analysis and aggregation. The requirement of diverse tooling is therefore important as the interaction patterns with the same data set are divers.  Importantly, complex ETL workflows need management, extensibility, and test-ability to ensure stable and reliable data processing. Pig provides strong support on all aspects. Pig jobs can be scheduled and managed with workflow tools like Oozie to build and orchestrate large scale, graph-like data pipelines.  Pig achieves extensibility with UDFs (User Defined Function), which let programmers add functions written in one of many programming languages. The benefit of this model is that any kind of special functionality can be injected and that Pig and Hadoop manage the distribution and parallel execution of the function on potentially huge data sets in an efficient manner. This allows the programmers to focus on adding and solving specific domain problems, e.g. like rectifying specific data set anomalies or converting data formats, without worrying about the complexity of distributed computing.  Reliable data pipelines require testing before deployment in production to ensure correctness of the numerous data transformation and combination steps. Pig has features supporting easy and testable development of data pipelines. Pig supports unit tests, an interactive shell, and the option to run in a local mode, which allows it to execute programs in a fashion not requiring a Hadoop cluster. Programmers can use these to test their Pig programs in detail with test data sets before they ever enter production and also help them try out ideas quickly and inexpensively, which is essential for fast development cycles.  None of these features are particularly glamorous yet they are important to evaluate Hadoop and data processing with it. The choice of leveraging Pig for a big data project can easily make the difference between success and failure.  

Wednesday, November 12, 2014

Hadoop Distributed File System(HDFS)

**** BLOCKS ****
All filesystem has a block size, a block size refers to the minimum amount of data that it can read or write. Filesystem blocks are typically a few kilobytes of normally 512 bytes.
Whereas in hadoop filesystem are having a much larger size of blocks i.e 64MB by default.

Now a question arises why the block system of hadoop are of big size ?

To start this - HDFS is meant to handle large files. If you have small files, smaller block sizes are better. If you have large files, larger block sizes are better.
For large files, lets say you have a 1000MB file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic! isn't it . If you use 64Mb blocks, the number of requests goes down to 16, greatly reducing the cost of overhead and load on the Name Node.

**** Namenodes and Datanodes ****
Hadoop cluster has two types of nodes which are operating in a master-slave pattern.
1) A Namenode (Master)
2) 'N' Datanodes (Slave)
* N is the number of machines.

The Namenode contains the filesystem namespace. It has the filesystem tree information and the metadata for all files and directories within the tree.This information is stored on the regular interval in the local disk in two formats :
1) Namespace image
2) Edit logs

NOTE : A Namenode also knows the datanodes on which the blocks for a given file is been stored. Namenode does not store the block location on the regular basis as the location of block is always rewriten when ever the system started.

A client always access the files with communicating with Namenode and Datanodes.

Datanode stores and retrieve the blocks when ever they asked to do by client or by Namenode. Dataode also report back to the Namenode about the information of list of block they are storing.
Without the namenode the filesystem or we can say the datanodes or in other words the blocks which contains our data cannot be accessed. This is the single point of failure till hadoop 1.x, but from hadoop 2.x we have different architecture of namenode configuration which contains the secondary namenode concepts.

The secondary Namenode also has serious bottleneck, which could not solve the Single Point of Failure of hadoop cluster.

In the upcoming post, I will explain more about the recent work about this issue.

Sunday, November 9, 2014

Let's start with Map-Reduce

As this is my first post regarding the Map-Reduce Programming, I will be trying to focus on the important points in a Map-Reduce framework.

Map-Reduce works by breaking the processing into two phases. 
1) The Map phase
2) The Reduce phase
Each phase (i.e map and reduce phase) has key-value pair as input and output.
The type of key value pair is been decided by a programmer as per the requirement.
The programmer also specify two function in each phase. The map function and reduce function respectively.
The map function is just the data preparation phase, setting up the data in such a way that the reducer function can do its work on it.
Note : The map function is a good place to drop the bad records.
The output from the map function is processed by the Map-Reduce framework before being sent to the reduce function.
Note : This processing Sorts and groups the key-value pairs by key.
The Mapper class is a generic type, with four formal type parameters that specifies the input key, input value, output key, output value types of the map function.
The map() method also provides an instance of Context to write the output.
In the reduce function again their are four formal type parameters are used to specify the input and output types.
Note : The input types of the reduce function must be of same match as of mapper output function.
Job - A job object forms the specification of the job and gives you control over how the job is run. When we run the job on Hadoop cluster we will package the code into jar file (which Hadoop will distribute around the cluster). Rather then explicitly specify the name of the jar file we can pass the class in the job's setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class.
In Job object, we specify the input and output paths. An input path is specified by calling the static addInputPath() method on FileInputFormat , and it can be a single file, a directory, or a file pattern.
Note : addInputPath() can be called more than once to use input from multiple paths.
Note : MultipleInputs class supports Map-Reduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
The output path (of which there is only one) is specified by the static setOutput Path() method on FileOutputFormat . It specifies a directory where the output files from the reducer functions are written. The directory shouldn’t exist before running the job because Hadoop will complain and not run the job.

Tuesday, November 4, 2014

Hashtable : Most useful java class to modify the meta-data management of Hadoop

Hashtables are an extremely useful mechanism for storing data. Hashtables work by mapping a key to a value, which is stored in an in-memory data structure. Rather than searching through all elements of the hashtable for a matching key, a hashing function analyses a key, and returns an index number. This index matches a stored value, and the data is then accessed. This is an extremely efficient data structure, and one all programmers should remember.
Hashtables are supported by Java, in the form of the java.util.Hashtable class. Hashtables accept as keys and values any Java object. You can use a String, for example, as a key, or perhaps a number such as an Integer. However, you can't use a primitive data type, so you'll need to instead use Char, Integer, Long, etc.
   // Use an Integer as a wrapper for an int
   Integer integer = new Integer ( i );
   hash.put( integer, data);
Data is placed into a hashtable through the put method, and can be accessed using the get method. It's important to know the key that maps to a value, otherwise its difficult to get the data back. If you want to process all the elements in a hashtable, you can always ask for an Enumeration of the hashtable's keys. The get method returns an object, which can then be cast back to the original object type.
   // Get all values with an enumeration of the keys
   for (Enumeration e = hash.keys(); e.hasMoreElements();)
       String str = (String) hash.get( e.nextElement() );
       System.out.println (str);
To demonstrate hashtables, I've written a little demo that adds one hundred strings to a hashtable. Each string is indexed by an Integer, which wraps the int primitive data type.  Individual elements can be returned, or the entire list can be displayed. Note that hashtables don't store keys sequentially, so there is no ordering to the list.

import java.util.*;

public class hash {
  public static void main (String args[]) throws Exception {
    // Start with ten, expand by ten when limit reached
    Hashtable hash = new Hashtable(10,10);

    for (int i = 0; i <= 100; i++)
 Integer integer = new Integer ( i );
 hash.put( integer, "Number : " + i);

    // Get value out again
    System.out.println (hash.get(new Integer(5)));
    // Get value out again
    System.out.println (hash.get(new Integer(21)));;

    // Get all values
    for (Enumeration e = hash.keys(); e.hasMoreElements();)
 System.out.println (hash.get(e.nextElement()));

Thursday, September 4, 2014

Benchmark Testing of Hadoop Cluster with TestDFSIO

This blogpost will help the newbie of Hadoop to learn about the performance measurement of Hadoop cluster.
In this article I will give the details of an important benchmarking tools that is included in the Apache Hadoop distribution. Namely, we look at the benchmarks TestDFSIO. TestDFSIO is one of best industry standard benchmarks used in recent days.


Before we start few things should be installed/configured in your system.  One needs to configure Hadoop cluster on his system. For that download Hadoop from here.
Configure either of these two.
i) Single user Mode.
ii) Multi Cluster Mode.


The TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS, to discover performance bottlenecks in your network, to shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes) and to give you a first impression of how fast your cluster is in terms of I/O.

A source code of the of the documentation, can be found here.

Run write tests before read tests :

Always perform the write test in HDFS before the read test.
The read test of TestDFSIO does not generate its own input files. For this reason, it is a convenient  to run a write test   and then follow-up with reading the same data.

Run a write test :

The command to run a write test that generates 10000 output files each  of 1 GB is:

$ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10000 -fileSize 1000

Run a read test :

The command will read the generated 10000 output files each of size 1GB is:

$ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10000 -fileSize 1000

Cleaning up and remove test data :

$ hadoop jar hadoop-*test*.jar TestDFSIO -clean

This will clear the data of the directory /benchmarks/TestDFSIO on HDFS

TestDFSIO results :

----- TestDFSIO ----- : write
           Date & time: Fri Apr 08 2011
       Number of files: 10000
Total MBytes processed: 10000000
     Throughput mb/sec: 4.989
Average IO rate mb/sec: 5.185
 IO rate std deviation: 0.960
    Test exec time sec: 1813.53

----- TestDFSIO ----- : read
           Date & time: Fri Apr 08 2011
       Number of files: 10000
Total MBytes processed: 10000000
     Throughput mb/sec: 11.349
Average IO rate mb/sec: 22.341
 IO rate std deviation: 119.231
    Test exec time sec: 1144.842

 Here, the most notable metrics are Throughput mb/sec and Average IO rate mb/sec. Both of them are based on the file size written (or read) by the individual map tasks and the elapsed time to do so.

  I will come up with more article on Hadoop. Thanks 

Sunday, August 31, 2014

Going through the Cloud

Various higher education institutions face increasing challenges due to shrinking revenues, budget restrictions and limited funds for R & D. Therefore it leads to a serious issue, regarding the stipends of research scholars.
Some colleges are looking toward business transformation strategies by organizing workshops, conferences and this brings up an ideological shift to enterprise culture to address the people and process sides of that paradigm. On the technology side, one of the most significant potential avenues for transformation is the adoption of Cloud Computing to reduce costs, boost performance and productivity and increase revenue. But unlike other countries, India is still lacking a lot for set the infrastructure of clouds in the institutions.

Basically the complete transformation has to address all three elements; but I will focus here on the technology aspect. Cloud computing is an effective and efficient technology that can provide education and prepare students as per the latest requirements of the job market, everything at a low cost and with no reduction of quality in the scope of education.

Most of the educational institutions of India host their IT services on premises. Shifting away from this model to cloud, be it Iaas or Paas, would allow them to focus on their core business— education—instead of having to allocate resources for things such as technical support, storage, help desks, e-labs and e-assessments. All the institutions could settle for a “pay as you go” model as to share the computational loads with various colleges, which would prevent the need to host and maintain dedicated infrastructures.

Though cloud computing can bring many benefits to higher education institutions, some issues still need to be considered and addressed, among which are:
  Security and the degree to which an institution is willing to relinquish a certain degree of control over that security
·     The legal issues surrounding data sharing
·     The service provider’s ability to ensure privacy controls and protect data ownership
·     The service provider’s ability to provide adequate and satisfactory support
·     The continuity and availability of data

Cloud computing is an attractive option for cost-conscious institutions seeking to reduce financial and environmental outlays while transforming education. This technological advancement can make knowledge available to entire communities, bridge the social and economic divide and prepare future generations of students and teachers to face the challenges of the global economy.

Sunday, February 16, 2014

Sound Processing in MATLAB

In this semester, I am pursuing the subject, "Speech Processing". A real fascinating subject .  I also liked it in every manner as my B.Tech final year project was actually related to Speech Processing. ( See my previous blog posts) .

 A little try with the sound processing of  Matlab. Hope you will like it.

What is digital sound data?

Getting a pre-recorded sound files (digital sound data)

Click here to access a .wav file

Download any .wav file from here name it anything. In my case , let it be 'road.wav'

Loading Sound files into MATLAB

·        I want to read the digital sound data from the .wav file into an array in our MATLAB workspace.  I can then listen to it, plot it, manipulate, etc.  Use the following command at the MATLAB prompt:


·       The array road now contains the stereo sound data and fs is the sampling frequency.  This data is sampled at the same rate as that on a music CD (fs=44,100 samples/second).

·       See the size of road:  

·       The left and right channel signals are the two columns of the road array:


·       Let’s plot the left data versus time.  Note that the plot will look solid because there are so many data points and the screen resolution can’t show them all.  This picture shows you where the signal is strong and weak over time.

time = (1/44100)*length(left);
plot(t,left)xlabel('time (sec)');
ylabel('relative signal strength'

·       Let’s plot a small portion so you can see some details

xlabel('time (sec)');
ylabel('relative signal strength')

 ·       Let’s listen to the data (plug in your headphones).  Click on the speaker icon in the lower right hand corner of your screen to adjust the volume.  Enter these commands below one at a time.  Wait until the sound stops from one command before you enter another sound command!

soundsc(left,fs)       % plays left channel as mono

soundsc(right,fs)    % plays right channel mono (sound nearly the same)

soundsc(road,fs)     % plays stereo