Friday, December 26, 2014

Top few NameNode-related Problems

In this blog post I wanted to share few NameNode-related issues that came up frequently in my research:

  • We want High Avalability(HA), but the NameNode is a single point of failure (SPOF). This results in downtime due to hardware failures and user errors. In addition, it is often non-trivial to recover from a NameNode failure, so our Hadoop administrators always need to be on call.
  • We want to run Hadoop with 100% commodity hardware. To run HDFS in production and not lose all our data in the event of a power outage, HDFS requires us to deploy a commercial NAS to which the NameNode can write a copy of its edit log. In addition to the prohibitive cost of a commercial NAS, the entire cluster goes down any time the NAS is down, because the NameNode needs to hard-mount the NAS (for consistency reasons).
  • We need both a NameNode and a Secondary NameNode. We read some documentation that suggested purchasing higher-end servers for these roles (e.g., dual power supplies). We only have 20 nodes in the cluster, so this represents a 15-20% hardware cost overhead with no real value (i.e., it doesn’t contribute to the overall capacity or throughput of the cluster).
  • We have a significant number of files. Even though we have hundreds of nodes in the cluster, the NameNode keeps all its metadata in memory, so we are limited to a maximum of only 50-100M files in the entire cluster. While we can work around that by concatenating files into larger files, that adds tremendous complexity. (Imagine what it would be like if you had to start combining the documents on your laptop into zip files because there was a severe limit on how many files you could have.)
  • We have a relatively small cluster, with only 10 nodes. Due to the DataNode-NameNode block report mechanism, we cannot exceed 100-200K blocks (or files) per node, thereby limiting our 10-node cluster to less than 2M files. While we can work around that by concatenating files into larger files, that adds tremendous complexity.
  • We need much higher performance when creating and processing a large number of files (especially small files). Hadoop is extremely slow.
  • We have had outages and latency spikes due to garbage collection on the NameNode. Although  CMS (concurrent mark and sweep) can be used for garbage collector, the NameNode still freezes occasionally, causing the DataNodes to lose connectivity (i.e., become blacklisted).
  • When we change permissions on a file (chmod 400 arch), the changes do not affect existing clients who have already opened the file. We have no way of knowing who the clients are. It’s impossible to know when the permission changes would really become effective, if at all.
  • We have lost data due to various errors on the NameNode. In one case, the root partition ran out of space, and the NameNode crashed with a corrupted edit log.

Monday, December 15, 2014

The Top 10 Big Data Quotes of All Time

  1. “Big data is at the foundation of all the megatrends that are happening today, from social to mobile to cloud to gaming.” – Chris Lynch, Vertica Systems
  2. “Big data is not about the data” – Gary King, Harvard University, making the point that while data is plentiful and easy to collect, the real value is in the analytics.
  3. “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.” – Eric Schmidt, of Google, said in 2010.
  4. “Information is the oil of the 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Gartner Research
  5. “I keep saying that the sexy job in the next 10 years will be statisticians, and I'm not kidding” – Hal Varian, Google
  6.  “You can have data without information, but you cannot have information without data.” Daniel Keys Moran, computer programmer and science fiction author
  7.  “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.” – Atul Butte, Stanford School of Medicine
  8. “Errors using inadequate data are much less than those using no data at all.” Charles Babbage, inventor and mathematician
  9. “To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem – he may be able to say what the experiment died of.” – Ronald Fisher, biologist, geneticist and statistician
  10. “Without big data, you are blind and deaf in the middle of a freeway” – Geoffrey Moore, management consultant and theorist

Thursday, December 11, 2014

Mongo-Hadoop Connector: The Last One I Used.

Last week during my project work, I used mongo-hadoop connector to create link table in 

To use mongo-hadoop connector, you need relevant jars in /usr/lib/hadoop, /usr/lib/hive based on your installation.

You can link a collection in MongoDB to a Hive table as below:

id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
It will create a persons table in HIVE which will show documents in collection or we can say its a kind of VIEW to the documents in collection.
When you execute:
select * from persons;
It will show all the documents present in collection test.persons.
But one MAJOR PITFALL here is that when you execute:
drop table persons;
It will drop collection in MongoDB which will be a BLUNDER when you link a collection with 500 million records (and I had done that mistake!!)
So, when you link a collection in HIVE, always create a EXTERNAL table like:
id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
Using this approach, even when you drop the table in HIVE, your collection in MongoDB will remain intact and you can escape the pain to again import 500 million records.
Sometimes small pointers can make HUGE difference!!

How to Use HDFS Programmatically

While HDFS can be manipulated explicitly through user commands, or implicitly as the input to or output from a Hadoop MapReduce job, you can also work with HDFS inside your own Java applications. (A JNI-based wrapper, libhdfs also provides this functionality in C/C++ programs.)

This section provides a short tutorial on using the Java-based HDFS API. It will be based on the following code listing:

1: import;
2: import;
4: import org.apache.hadoop.conf.Configuration;
5: import org.apache.hadoop.fs.FileSystem;
6: import org.apache.hadoop.fs.FSDataInputStream;
7: import org.apache.hadoop.fs.FSDataOutputStream;
8: import org.apache.hadoop.fs.Path;
10: public class HDFSHelloWorld {
12: public static final String theFilename = “hello.txt”;
13: public static final String message = “Hello, world!\n”;
15: public static void main (String [] args) throws IOException {
17: Configuration conf = new Configuration();
18: FileSystem fs = FileSystem.get(conf);
20: Path filenamePath = new Path(theFilename);
22: try {
23: if (fs.exists(filenamePath)) {
24: // remove the file first
25: fs.delete(filenamePath);
26: }
28: FSDataOutputStream out = fs.create(filenamePath);
29: out.writeUTF(message;
30: out.close();
32: FSDataInputStream in =;
33: String messageIn = in.readUTF();
34: System.out.print(messageIn);
35: in.close();
46: } catch (IOException ioe) {
47: System.err.println(“IOException during operation: ” + ioe.toString());
48: System.exit(1);
49: }
40: }
41: }

This program creates a file named hello.txt, writes a short message into it, then reads it back and prints it to the screen. If the file already existed, it is deleted first.

First we get a handle to an abstract FileSystem object, as specified by the application configuration. The Configuration object created uses the default parameters.

17: Configuration conf = new Configuration();
18: FileSystem fs = FileSystem.get(conf);

The FileSystem interface actually provides a generic abstraction suitable for use in several file systems. Depending on the Hadoop configuration, this may use HDFS or the local file system or a different one altogether. If this test program is launched via the ordinary ‘java classname’ command line, it may not find conf/hadoop-site.xml and will use the local file system. To ensure that it uses the proper Hadoop configuration, launch this program through Hadoop by putting it in a jar and running:

$HADOOP_HOME/bin/hadoop jar yourjar HDFSHelloWorld

Regardless of how you launch the program and which file system it connects to, writing to a file is done in the same way:

28: FSDataOutputStream out = fs.create(filenamePath);
29: out.writeUTF(message);
30: out.close();

First we create the file with the fs.create() call, which returns an FSDataOutputStream used to write data into the file. We then write the information using ordinary stream writing functions; FSDataOutputStream extends the class. When we are done with the file, we close the stream with out.close().

This call to fs.create() will overwrite the file if it already exists, but for sake of example, this program explicitly removes the file first anyway (note that depending on this explicit prior removal is technically a race condition). Testing for whether a file exists and removing an existing file are performed by lines 23-26:

23: if (fs.exists(filenamePath)) {
24: // remove the file first
25: fs.delete(filenamePath);
26: }

Other operations such as copying, moving, and renaming are equally straightforward operations on Path objects performed by the FileSystem.

Finally, we re-open the file for read, and pull the bytes from the file, converting them to a UTF-8 encoded string in the process, and print to the screen:

32: FSDataInputStream in =;
33: String messageIn = in.readUTF();
34: System.out.print(messageIn);
35: in.close();

The method returns an FSDataInputStream, which subclasses Data can be read from the stream using the readUTF() operation, as on line 33. When we are done with the stream, we call close() to free the handle associated with the file.

More information:

Complete JavaDoc for the HDFS API is provided at

Another example HDFS application is available on the Hadoop wiki. This implements a file copy operation.

Sunday, December 7, 2014

Few Common Problems You Face While Installing Hadoop

Some common problems, that a person faces while installing Hadoop. Here are few problems listed below.

1. Problem with ssh configuration.
 error: connection refused to port 22

2. NameNode node not reachable
    error: Retrying to connect

1. Problem with ssh configuration: In this case you may face many kind of errors, but most common one while installing Hadoop is connection refused to port 22. Here you should check if machine on which you are trying to login, should have ssh server installed.
If you are using Ubuntu, you can install ssh server using following command.
   $sudo apt-get install openssh-server
   On CentOs or Redhat you can install ssh server using yum package manager
   $sudo yum install openssh-server
   After you have installed ssh server, make sure you have configured the keys properly and share public key with the machine that you want to login into. If the problem persists then check for configurations of ssh in your machine. you can check configuration in /etc/ssh/sshd_config file. use following command to read this file
   $sudo gedit /etc/ssh/sshd_config
   In this file RSA Authentication should be set to yes, password less authentication also should be yes.
   after this close the file and restart ssh with following command
   $sudo /etc/init.d/ssh restart
   Now your problem should be resolved. Apart from this error you can face one more issue. Even though you have configured keys correctly, ssh is still prompting for password. In that case check if keys are being managed by ssh. For that run following command. your keys should be in 

   $HOME/.ssh folder

 2. If your Namenode is not reachable, first thing you should check is demons running on Namenode machine. you can check that with following command

   This command tells you all java processes running on your machine. If you don't see Namenode in the output list, do the following. Stop the hadoop with following command.
   Format the Namenode using following command
   $HADOOP_HOME/bin/hadoop namenode -format
   start hadoop with following command
   This time Namenode should run. if you are still not able to start namenode. then check for core-site.xml file in conf directory of hadoop with following command
   $gedit $HADOOP_HOME/conf/core-site.xml
   Check for value for property hadoop.tmp.dir. it should be set to a path where user who is trying to run hadoop has write permissions. if you dont want to scratch your head on this set it to $HOME/hadoop_tmp directory. Now save and close this file. Format the namenode again and try starting hadoop again. Things should work this time.

   Thats all for this posts, Please share problems that you are facing, we will try to solve them together. stay tuned for more stuff :)   

Friday, December 5, 2014

Hive : How to install it on top of Hadoop in Ubuntu

What is Apache Hive?

Apache Hive is a data warehouse infrastructure that facilitates querying and managing large data sets which resides in distributed storage system. It is built on top of Hadoop and developed by Facebook. Hive provides a way to query the data using a SQL-like query language called HiveQL(Hive query Language).

Internally, a compiler translates HiveQL statements into MapReduce jobs, which are then submitted to Hadoop framework for execution.

Difference between Hive and SQL?

Hive looks very much similar like traditional database with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences:

As Hadoop is intended for long sequential scans and Hive is based on Hadoop, you would expect queries to have a very high latency. It means that Hive would not be appropriate for those applications that need very fast response times, as you can expect with a traditional RDBMS database.

Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Hive Installation on Ubuntu:

Follow the below steps to install Apache Hive on Ubuntu:

Step 1:  Download Hive tar.

Download the latest Hive version from here

Step 2: untar the file.

Step 3: Edit the “.bashrc” file to update the environment variables for user.

   $sudo gedit .bashrc

Add the following at the end of the file:

export HADOOP_HOME=/home/user/hadoop-2.4.0
export HIVE_HOME=/home/user/hive-0.14.0-bin
export PATH=$PATH:$HIVE_HOME/bin

Step 4:  Create Hive directories within HDFS.

NOTE: Run the commands from bin folder of hadoop[installed]

$hadoop fs -mkdir /user/hive/storage

The directory ‘storage’ is the location to store the table or data related to hive.

$hadoop fs -mkdir /tmp

The temporary directory ‘tmp’is the temporary location to store the intermediate result of processing.

Step 5: Set read/write permissions for table.

In this command we are giving written permission to the group:

$hadoop fs -chmod 774  /user/hive/warehouse

$hadoop fs -chmod 774  /tmp

Step 6:  Set Hadoop path in Hive

cd hadoop // my current directory where hadoop is stored.
cd hive*-bin
cd bin
sudo gedit

In the configuration file , add the following

export HADOOP_HOME=/home/user/hadoop-2.4.0

Step 7: Launch Hive.

***[run from bin of Hive] 

Command: $hive

Step 8: Test your setup

$show tables; 

Don't forget to put semicolon after this command :P
Press Ctrl+C to exit Hive 

Happy Hadooping! ;) 

Wednesday, December 3, 2014

Books and Related Stuffs You Need to Get Started with Hadoop

In my last few blogs, I provided the basic knowledge of Hadoop and HDFS. Few on Mapreduce too. The emails I got from various readers of the blogs are appreciating. Many of the readers got attracted to Big Data and Hadoop Technology. 

For a further help, I  would like to let you know about a few of the books and web resources from where you can start reading the same. This blog is dedicated to the same.

So if you are reading my blog articles and interested in learning Hadoop, you must be familiar about the power of Big Data and why people are going gaga over this Big Data.

You can refer to these small articles about Big Data ,HDFS and Mapreduce.

You may like to read about Pre-requisites for getting started with Big Data Technologies to get yourself started with Big Data technologies.

Now the main topic of the blog.

The first book i would recommend you guys out there will be: Hadoop The Definitive Guide 3rd Edition by Tom White. I started my Big Data Journey with this book and believe me it is the best resource for you if you are naive in the Big Data World. The book is elegantly written to understand the concept topic-wise. It also gives you an Example of Wearther Dataset which is carried almost through out the book to help you understand how things go in hadoop.

The second book I like reading and which is also very helpful is: Hadoop in Practice by Alex Holmes. Hadoop in Practice collects 85 battle-tested examples and presents them in a problem/solution format. It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning how to build a specific solution along with the thinking that went into it. As a bonus, the book's examples create a well-structured and understandable codebase you can tweak to meet your own needs.

The third one which is written real simpl will be: Hadoop in Action by Chuck Lam. Hadoop in Action teaches readers how to use Hadoop and write MapReduce programs. The intended readers are programmers, architects, and project managers who have to process large amounts of data offline. Hadoop in Action will lead the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs. 
Note: this book uses old Hadoop API

And lastly if you are more into administrative side you can go for Hadoop's Operations by Eric Sammer. Along with the development this book talks mainly about administrating and maintenance of huge clusters for large data-set in the production environment. Eric Sammer, Principal Solution Architect at Cloudera, shows you the particulars of running Hadoop in production, from planning, installing, and configuring the system to providing ongoing maintenance.

Well these are the books that you can refer for your understanding and better conceptual visualization and practical Hands-on of working with Hadoop Farmework

Apart from these books if want to go for the API, you can see Hadoop API Docs here and also very useful is: Data-Intensive Text Processing with MapReduce

Hope you will find these books and resources helpful to understand in-depth of Hadoop and its power.

If you have any question or you want any specific tutorial on Hadoop you can go request for the same in the email address. I will try to get back to you as soon as possible :)


Tuesday, December 2, 2014

How to Install Hadoop on Ubuntu or any other Linux Platform

Following are the steps for installing Hadoop. I have just listed the steps with very brief explanation at some places. This is more or less like some reference notes for installation. I made a note of this when I was installing Hadoop on my system for the very first time.

Please let me know if you need any specific details.

Installing HDFS (Hadoop Distributed File System)
OS : Ubuntu

Installing Sun Java on Ubuntu
$sudo apt-get update
$sudo apt-get install oracle-java7-installer
$sudo update-java-alternatives -s java-7-oracle

Create hadoop user

$sudo addgroup hadoop
$sudo adduser —ingroup hadoop hduser

Install SSH Server if not already present. This is needed as hadoop does an ssh into localhost for execution.

$ sudo apt-get install openssh-server
$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Installing Hadoop

Download hadoop from Apache Downloads.

Download link for latest hadoop 2.6.0 can be found  here
Download hadoop-2.6.0.tar.gz from the link.
mv hadoop-2.6.0  hadoop

Edit .bashrc

# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Add Hadoop bin/ directory to PATH


We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:

$gedit /home/hduser/hadoop/conf/

Add the following

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Temp directory for hadoop

$mkdir /home/hduser/tmp

Configurations for hadoop

cd home/hduser/hadoop/conf/

Then add the following configurations between <configuration> .. </configuration> xml elements:


<description>A base for other temporary directories.</description>

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>


Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:

<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

Formatting NameNode

You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.

Run the following command

$/home/hduser/hadoop/bin/hadoop namenode -format

Starting Hadoop Cluster

From hadoop/bin


To check for processes running use:


If jps is not installed, do the following

sudo update-alternatives --install /usr/bin/jps jps /usr/lib/jvm/jdk1.6/bin/jps 1

Tasks running should be as follows:


NOTE : This is for single node setup.If you configure it for cluster node setup, the demons will be shown in the specific serves.

Stopping Hadoop Cluster

From hadoop/bin


Example Application to test success of hadoop:

Follow my this post to test whether it is successfully configured or not :)

For any other query, feel free to comment in the below thread. :)

Happy Hadooping.

Pre-requisites for getting started with Big Data Technologies

Lots of my friends who have heard about Big Data World or may be interested in getting into the same, have this query, what are the prerequisite for learning or may be start digging into Big Data Technologies. And what are the technologies that comes under Big Data.

Well this is quite a difficult question to answer, because there is no distinct draw between what comes under the hood. But one thing is for sure that Big Data is not only about Hadoop as lots of us out there have this misconception.

Hadoop is just a framework that is being used in Big Data. And yes it is used quite a lot or if i can say it is one of the integral part of Big Data. But beside Hadoop there are tons of tools and technologies that comes under the same. To name a few we have:

  • Cassandra 
  • HBase             
  • MongoDB              
  • CouchDB              
  • Accumulo        
  • HCatlog     
  •  Hive                      
  •  Sqoop       
  •  Flume          and many more! 

 OK, now if we look at the NoSql Databases (that's what we call databases handling unstructured data in Big Data) and different tools, mentioned above, most of them (few being exception) is written in JAVA including Hadoop. So as a programmer if you want to know and go in depth of the architectural APIs, Core Java is the recommended programming language that will help you to grasp the technology in a better and more efficient way.

Now if i am saying that core java is recommended that doesn't imply that people who don't know Java, have no scope in the same. Because Big data is all about managing the data more efficiently, more intelligently.

So people who have the knowledge of data warehousing gets a plus point here. Managing large amount of data and playing around the same with its volume, velocity variety and complexity is the work of a Big Data Scientist.

Apart from Data warehousing background, People having experience with Machine learning, Theory of Computation, and Sentiment Analysis are contributing a lot in this World.

So it will be unfair to say, that who can and who cannot work in Big Data technology. Its an emerging field where most of us can lay our hand and can contribute in its growth and development, And Yes the most important thing that's what I like about being in Big Data is that, most of the tools are open source. So I can play around with the Source Code :)

For any query feel free to mail me