Tuesday, December 2, 2014

How to Install Hadoop on Ubuntu or any other Linux Platform

Following are the steps for installing Hadoop. I have just listed the steps with very brief explanation at some places. This is more or less like some reference notes for installation. I made a note of this when I was installing Hadoop on my system for the very first time.

Please let me know if you need any specific details.

Installing HDFS (Hadoop Distributed File System)
OS : Ubuntu

Installing Sun Java on Ubuntu
$sudo apt-get update
$sudo apt-get install oracle-java7-installer
$sudo update-java-alternatives -s java-7-oracle

Create hadoop user

$sudo addgroup hadoop
$sudo adduser —ingroup hadoop hduser

Install SSH Server if not already present. This is needed as hadoop does an ssh into localhost for execution.

$ sudo apt-get install openssh-server
$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Installing Hadoop

Download hadoop from Apache Downloads.

Download link for latest hadoop 2.6.0 can be found  here
Download hadoop-2.6.0.tar.gz from the link.
mv hadoop-2.6.0  hadoop

Edit .bashrc

# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Add Hadoop bin/ directory to PATH

Update hadoop-env.sh

We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:

$gedit /home/hduser/hadoop/conf/hadoop-env.sh

Add the following

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Temp directory for hadoop

$mkdir /home/hduser/tmp

Configurations for hadoop

cd home/hduser/hadoop/conf/

Then add the following configurations between <configuration> .. </configuration> xml elements:


<description>A base for other temporary directories.</description>

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>


Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:

<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

Formatting NameNode

You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.

Run the following command

$/home/hduser/hadoop/bin/hadoop namenode -format

Starting Hadoop Cluster

From hadoop/bin


To check for processes running use:


If jps is not installed, do the following

sudo update-alternatives --install /usr/bin/jps jps /usr/lib/jvm/jdk1.6/bin/jps 1

Tasks running should be as follows:


NOTE : This is for single node setup.If you configure it for cluster node setup, the demons will be shown in the specific serves.

Stopping Hadoop Cluster

From hadoop/bin


Example Application to test success of hadoop:

Follow my this post to test whether it is successfully configured or not :)


For any other query, feel free to comment in the below thread. :)

Happy Hadooping.


Anonymous said...

Thanks Dipayan . This is really helpful.

Dipayan Dev said...

Welcome :) Do comment with your own name.

About Me said...

tip - use syntax highlighter ;) :) (y)

Dipayan Dev said...

Sure ! Thanks for the suggestion. :)

Unknown said...

the directory : /home/hduser/hadoop/conf/
is missing :)

Unknown said...

/home/hduser/hadoop/conf/ is not true :)

Dipayan Dev said...

you can use %HADOOP_HOME/etc/hadoop/

Unknown said...

Apart from learning more about Hadoop at hadoop online training, this blog adds to my learning platforms. Great work done by the webmasters. Thanks for your research and experience sharing on a platform like this.

Vikyjohn said...

Thanks for the blog.please keep updating.Hadoop is a platform for storing and processing of Data in an environment with clusters of computers using simple programming language.It is designed in such a way that it connects from single servers to group of servers with proper computation and storage.

Hadoop training in chennai