Saturday, April 23, 2016

Running asynchronous job in Resque: How it works :)

Resque is a lightweight, fast, and a very powerful message queuing system used to run Ruby jobs in the background, asynchronously.  It’s generally used when the system asks for better scalability and quick response.
A Resque job is any Ruby class or module with a perform class method.

class MyDesign
  @queue = :myqueue
#Jobs will be placed and picked up from this queue.
# call to queue processing in Resque until later
  def self.perform(seconds)

How the queuing of Redis works:

The real ability of resque comes from its queue,  Redis. Redis is a NoSQL key-value store. Unlike other NoSQL key-value stores, which takes only strings as keys/values, Redis can process lists, sets, hashes, and sorted sets as values, and operates on them atomically.  Resque works on the top of Redis list datatype, where the queue name behaves as a key and list as values.
Jobs of resque are en-queued on the Redis queue, and the designated workers de-queues the same for processing. All these operations are atomic, hence the en-queuers and workers do not bother about the locking and synchronous access. Every element of a list (set, hashes etc.) must be string, and so the data structures are not nested in Redis.

Redis is a very fast, in-memory dataset, and can persist to disk (configurable by time or number of operations), or save operations to a log file for recovery after a re-start, and supports master-slave replication.

Redis owns the command set to read and process the keys and it does not need SQL to inspect its data.

How Queuing with Resque works

Resque stores the job queue in Redis list named “resque:queue:name”, where each element of the list is a hash. Redis also maintains its own management structures,  with an additional “failed” job list.
It keeps all the the operation light.  We don’t need to  pass a lot of data to the job. Instead passing the references to other records, files, etc. is sufficient for the workers to pop and perform the jobs.

When a job is popped from the queue, Resque instantiates an object of its parent class and calls it’s perform method, passing the additional parameters.

Calling external systems with Resque

There are ports of Resque to other languages such as python, C, Java, .NET, node, PHP and Clojure. If your external system is written in one of these languages, then you can start workers listening to their queues. Since you are not talking to a ruby class with arguments, you can set up a placeholder class with the proper queue name. This will allow Resque plugins to fire on enqueue. 

Monday, December 28, 2015

Digging more into NoSQL- Explaining Redis

In this era of Big Data, the first thing we need is an in-memory data structure. In this post, I will elaborate a useful in-memory store, called Redis.
As they say If you can map a use case to Redis and discover you aren't at risk of running out of RAM by using Redis there is a good chance you should probably use Redis.

What is Redis?
A simplest answer can be, its only a data structure server. It can be differentiated from MongoDB, which is a disk-based document store. Although MongoDB can be used as a key-value store too.

Those two additions may seem pretty minor, but they are what make Redis pretty incredible.Persistence to disk means you can use Redis as a real database instead of just a volatile cache. The data won't disappear when you restart, like with memcached.

The additional data types are probably even more important. Key values can be simple strings, like one will find in memcached, but they can also be more complex types like Hashes, Lists (ordered collection, makes a great queue), Sets (unordered collection of non-repeating values), or Sorted Sets (ordered/ranked collection of non-repeating values).

The entire data set in Redis, is stored in-memory so it is extremely fast, often even faster than memcached. Redis had virtual memory, where rarely used values would be swapped out to disk, so only the keys had to fit into memory, but this has been deprecated. Going forward the use cases for Redis are those where its possible (and desirable) for the entire data set to fit into memory.

Redis becomes fantastic choice if you want a highly scalable data store shared by multiple processes, multiple applications, or multiple servers. As just an inter-process communication mechanism it is tough to beat. The fact that you can communicate cross-platform, cross-server, or cross-application just as easily makes it a pretty great choice for many many use cases. Its speed also makes it great as a caching layer.

For more info :

Wednesday, December 23, 2015

How a process is Deamon-ized in Linux

A Linux process works either in foreground or background.

A process running in foreground can interact with the user in front of the terminal. To run a.out in foreground we execute as shown below.
When a process runs in the foreground, it can interact with the users in front of the terminal. To run, we execute the command a.out.
 However, for a background process, it runs without any interaction of any user’s interaction. But obviously a user can check its current status, though he doesn’t know or might be doesn’t need to know what it is doing. 
The command is similar to the other, with some changes:
$ ./a.out  &  
[1] 8756
As shown above when we run a process with '&' at the end, then the process runs in background and returns the process id (8756 in above example).

Now back to the current topic.

What is actually a DAEMON Process?
A 'daemon' process is a process that runs in the background, begins execution at startup 
(not necessarily), runs forever, usually do not die or get restarted, waits for requests to arrive and respond to them and frequently spawn other processes to handle these requests.

So running a process in BACKGROUND with a while loop logic in code to loop forever makes a Daemon ? Yes and also No. But there are certain things to be considered when we create a daemon process. 

Let's follow a step-by-step procedure to create a daemon process.

1. Create a separate child process - fork() it.

fork() system call create a copy of our process(child), then let the parent process exit. Once the parent process exits the Orphaned child process will become the child of init process (this is the initial system process, in other words the parent of all processes). As a result our process will be completely detached from its parent and start operating in background.

pid=fork();    if (pid<0) exit(1); /* fork error */    if (pid>0) exit(0); /* parent exits */    /* child (daemon) continues */  

2. Make child process In-dependent - setsid()

Before we move to check how we are going to make a child process independent, let  talk a bit about Process group and Session ID.

A process group denotes a collection of one or more processes. Process groups are used to control the distribution of signals. A signal directed to a process group is delivered individually to all of the processes that are members of the group. 

Process groups are themselves grouped into sessions. Process groups are not permitted to migrate from one session to another, and a process may only create new process groups belonging to the same session as it itself belongs to. Processes are not permitted to join process groups that are not in the same session as they themselves are.

New process images created by a call to a function of the exec family and fork() inherit the process group membership and the session membership of the parent process image.

A process receives signals from the terminal that it is connected to, and each process inherits its parent's controlling tty. A daemon process should not receive signals from the process that started it, so it must detach itself from its controlling tty.

In Unix , processes operates within a process group, so that all processes within a group is treated as a single entity. Process group or session is also inherited. A daemon process should operate independently from other processes.

setsid() system call is used to create a new session containing a single (new) process group, with the current process as both the session leader and the process group leader of that single process group. 

3. Change current Running Directory - chdir()

A daemon process should run in a known directory. There are many advantages, in fact the opposite has many disadvantages: suppose that our daemon process is started in a user's home directory, it will not be able to find some input and output files. If the home directory is a mounted filesystem then it will even create many issues if the filesystem is accidentally un-mounted.

The root "/" directory may not be appropriate for every server, it should be chosen carefully depending on the type of the server.

4. Close Inherited Descriptors and Standard I/O Descriptors

A child process inherits default standard I/O descriptors and opened file descriptors from a parent process, this may cause the use of resources un-neccessarily. Unnecessary file descriptors should be closed before fork() system call (so that they are not inherited) or close all open descriptors as soon as the child process starts running as shown below.

for ( i=getdtablesize(); i>=0; --i)    
close(i); /* close all descriptors */  
There are three standard I/O descriptors: 
  1. standard input 'stdin' (0),
  2. standard output 'stdout' (1),
  3. standard error 'stderr' (2).

For safety, these descriptors should be opened and connected to a harmless I/O device (such as /dev/null).

int fd; 
fd = open("/dev/null",O_RDWR, 0); 
   if (fd != -1)     
 dup2 (fd, STDIN_FILENO); 
dup2 (fd, STDOUT_FILENO);    
dup2 (fd, STDERR_FILENO);       
 if (fd > 2)   
 close (fd); 

5. Reset File Creation Mask - umask()

Most Daemon processes runs as super-user, for security reasons they should protect files that they create. Setting user mask will prevent unsecure file priviliges that may occur on file creation.
view plainprint?
This will restrict file creation mode to 750 (complement of 027).

Friday, June 5, 2015

How to parse XML using XPath with Nokogiri Ruby : A Begining in Web Crawl

Xpath is a language used to find information in an XML or HTML files. XPath is used to navigate through several attributes or elements in an XML document. XPath can also be used to traverse through an XML file in Ruby. We use Nokogiri, a gem of Ruby for that purpose.

XPath is found to be a very important tool for fetching the relevant information, reading attributes and items in XML file.

Before you start reading this post, I should suggest you to learn a bit about XPath from here.

We will consider the following XML file for the demo, that holds the information of employees. 

 <?xml version="1.0"?>
    <Employee id="1111" type="admin">
    <Employee id="2222" type="admin">
    <Employee id="3333" type="user">
    <Employee id="4444" type="user">

If we go though the code, we can see there are four employees. Attribute-id type, Child nodes - firstname, lastname, age and email.
Lets now start with the code. We will use Nokogiri , a gem of Ruby which provides wonderfulAPI to parse, search the documents via XPath.


Ex 1. Read firstname of all employees
 require 'nokogiri'
f ="employee.xml")
doc = Nokogiri::XML(f)

puts "== First name of all employees"
expression = "Employees/Employee/firstname"
nodes = doc.xpath(expression)

nodes.each do |node|
  p node.text

Output : 


Ex 2: Read firstname of all employees who are older than 40 year
expression = "/Employees/Employee[age>40]/firstname"
nodes = doc.xpath(expression)
nodes.each do |node|
 p "#{ node.text }"



That is for today. I will write more about different process of Web Crawling. It is just a small beginning. Thanks for reading.

Friday, May 8, 2015

Wednesday, May 6, 2015

Parsing a CSV file having "hyphen" separated column value.

One thing I have learnt from past a year or so. If you say you want to know Big Data Analytic, the first thing you should learn is how to parse a CSV ( Comma Separated Value) file.

In this blog, I will explain the how to parse a CSV file having "Hyphen" separated column value. These things are actually asked in the interviews and sometime becomes too tricky.

I will use Java to parse the same.

Here is one file containing various attributes, column and different statistics of Cricket players of various countries. csi-batting.csv

The Question is:

Find the total score of the Afghanistan players in the year of 2010.

For our convenience I will show first few lines of the CSV file here.

** Each paragraph here contains each line in the CSV File"

Afghanistan,Mohammad Shahzad,118,97.52,16-02-2010,Tue,Sharjah CA Stadium,Canada,../Matches/MatchScorecard_ODI.asp?MatchCode=3087

Afghanistan,Mohammad Shahzad,110,99.09,01-09-2009,Tue,VRA Ground,Netherlands,../Matches/MatchScorecard_ODI.asp?MatchCode=3008

Afghanistan,Mohammad Shahzad,100,138.88,16-08-2010,Mon,Cambusdoon New Ground,Scotland,../Matches/MatchScorecard_ODI.asp?MatchCode=3164

Afghanistan,Mohammad Shahzad,82,75.92,10-07-2010,Sat,Hazelaarweg,Netherlands,../Matches/MatchScorecard_ODI.asp?MatchCode=3153

Afghanistan,Mohammad Shahzad,57,100,01-07-2010,Thu,Sportpark Westvliet,Canada,../Matches/MatchScorecard_ODI.asp?MatchCode=3135

Here you will find that, the date column (MatchDate) of the CSV file is hyphen separated. So, the usual to split the columns with (",") will NOT work here.

What can we do then?

Split the columns with (",") and get those in seprated variable and then split that particular MatchDate column with ("-").

Lets see the code here.
© Dipayan Dev


public class A {

public static void main(String args[]) throws FileNotFoundException

/*Put that file according to your wish and change the string */

String csv="C:\\Users\\Dipayan\\Desktop\\odi-batting.csv";  
BufferedReader br=new BufferedReader(new FileReader(csv));

String line=" ";
int sum=0;
int count=0;
int []a=new int[10000];

try {
} catch (IOException e) {
// TODO Auto-generated catch block
try {

String [] f= line.split(","); /* Splitting each column and storing each of them in array f*/ 
String con=f[0];
String date = f[4];
String year = date.split("-")[2]; /* Split the second column using hyphen*/
if (year.equals("2010") && con.equals("Afghanistan")) {
   a[count] = Integer.parseInt(f[2]);
   sum += a[count];
} catch (NumberFormatException | IOException e) {
// TODO Auto-generated catch block


Thursday, February 19, 2015

HDFS Comic : The easiest way to learn Hadoop's File System.

Here is a comic that desribes the various functioning of HDFS.

Disclaimer : This has been taken from another resource.

Joins in Map-Reduce

There are 2 kinds of joins in Map-Reduce - map side join and the reduce side join.

  1. Map-side join - This join happens before the input reaches the map phase. It is suited for 2 scenarios:

One of the inputs is small enough to be fit in memory - Consider the example of some kind of a metadata which needs to be associated with a much larger number of records. In this particular case, the smaller input could be replicated across all the tasktracker nodes in memory and a join could be performed as the bigger input is being read by the mapper.

Both the inputs are sorted and partitioned into equal sizes with the guarantee that records belonging to a key fall in the same partition - Consider the example of outputs coming out of multiple reducer jobs which had equal number of reducers and the same keys emitted. In this case, an index could be built from one of the inputs (key, filename, offset) and it could be looked up as the other input is read.

         2.   Reduce-side join - This join happens at the reducer phase. It places no restrictions on the size of the input, the only disadvantage being that all the data/records (from both the inputs) have to go through the shuffle and sort phase. It works as following : The map phase tags the records with an identifier to distinguish the sources and the parsing logic at the reducer. Records pertaining to the same key reach the same reducer and the reducer takes care of joining, taking care of the fact that records from different source tags need to be parsed and dealt with differently.

Friday, February 6, 2015

Optimum Parameters for Best Practices of Large-Scale File System

 Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders if used correctly and wisely. We all know the usage of these tools. But there are some points, if followed can take the core efficiency out of these tools. 

 Choosing the number of map and reduce tasks for a job is important.

a. If each tasks takes less than 30-40 seconds, reduce the number of tasks. The task setup and scheduling overhead is a few seconds, so if tasks finish very quickly, you are wasting time while not doing work. In simple words, your task is under loaded. Better increase the task load and utilize it to the fullest. Another option can be the reuse of JVM. The JVM spawned by one mapper can be reused by the other one, so that there is no overhead of spawning of an extra JVM.

b. If you are dealing with a huge input data size, for example, suppose 1TB, then consider increasing the block size of the input dataset to 256M or 512M, so that less number of mappers will be spawned. Increasing the number of mappers by decreasing the block size is not a good practice. Hadoop is designed to work on larger amount of data to reduce the disk seek time and increase the computation speed. So always define the HDFS block size larger enough to allow Hadoop to compute effectively.

c. If you have 50 map slots in your cluster, avoid jobs using 51 or 52 mappers, because the first 50 mappers finish at the same time and then the 51st and the 52nd will run before the reducer task can be started. So just increasing the number of mappers to 500, or 1000 or even to 2000 does not speed your job. The mappers will run in parallel according to the map slots available in your cluster. If map slot available is 50 only 50 will run in parallel, others will be in queue, waiting for the map slots to be available.

d. The number of reduce tasks should always be equal less than the reduce slot available in your cluster.

e. Sometime we don’t really use reducers. For example filtering and reduce noise in data. In these cases make sure you set the number of reducers to zero since the sorting and shuffling is an expensive operation.

Friday, January 2, 2015

Real Life Application of HashTable

There are so many situations where you have one piece of information (key) and want a system to give you more information based on that key.

This is the need behind the evolution and implementation of several computing structures such as databases, binary trees, hash tables, and supporting algorithms.

Hash Tables have the following features:

  1. They facilitate a quick and sometimes inexpensive way to retrieve information. They consume little CPU and if small-enough can fit in RAM. The speed is gained from the way the data location is calculated from the key, this is done in almost a linear fashion which outperforms other methods like binary search and linear search
  2. They can store information based on a key.
  3. They are language/technology independent. They don't require special hardware or software to implement them as the majority of programming languages would be sufficient to create the necessary algorithms for them.
  4. They have friendly interface - to get data, you pass a key and to store data, you pass a key in addition to the data.
  5. The theory allows the storage and retrieval of data based on numeric as well as non-numeric key values.
    Hash tables are used in memory during the processing of a program (they can be persisted to disk but that is a different topic), some usages are  :
  6. Facilitation of Associative Arrays - 
  7. Lookup values (states, provinces, etc.). You could load small amounts from database into hash tables for quick lookups (decoding an encoding of data) - This is of paramount effect in large batch jobs for Extract Load and Transform scenarios. It is also very valuable for data validation.
  8.  Data Buffering. You could store frequently used data from a database in a hash table to facilitate quick access.
  9.  Uniqueness Checking - You can use hash tables to ensure that no value is duplicate in a list.
  10. Keyword Recognition - In cases you want to identify if a given text has certain keywords in it or not, instead of checking the database with each value, you could use a hash table
  11.  Decision tables - Large conditional flows may be stored in an array where given a condition id, you could retrieve and execute related code segments (this may be used in interpreted languages).
  12. Game programming could use Hash tables to keep track of player scores, weapons of a player, etc.

Friday, December 26, 2014

Top few NameNode-related Problems

In this blog post I wanted to share few NameNode-related issues that came up frequently in my research:

  • We want High Avalability(HA), but the NameNode is a single point of failure (SPOF). This results in downtime due to hardware failures and user errors. In addition, it is often non-trivial to recover from a NameNode failure, so our Hadoop administrators always need to be on call.
  • We want to run Hadoop with 100% commodity hardware. To run HDFS in production and not lose all our data in the event of a power outage, HDFS requires us to deploy a commercial NAS to which the NameNode can write a copy of its edit log. In addition to the prohibitive cost of a commercial NAS, the entire cluster goes down any time the NAS is down, because the NameNode needs to hard-mount the NAS (for consistency reasons).
  • We need both a NameNode and a Secondary NameNode. We read some documentation that suggested purchasing higher-end servers for these roles (e.g., dual power supplies). We only have 20 nodes in the cluster, so this represents a 15-20% hardware cost overhead with no real value (i.e., it doesn’t contribute to the overall capacity or throughput of the cluster).
  • We have a significant number of files. Even though we have hundreds of nodes in the cluster, the NameNode keeps all its metadata in memory, so we are limited to a maximum of only 50-100M files in the entire cluster. While we can work around that by concatenating files into larger files, that adds tremendous complexity. (Imagine what it would be like if you had to start combining the documents on your laptop into zip files because there was a severe limit on how many files you could have.)
  • We have a relatively small cluster, with only 10 nodes. Due to the DataNode-NameNode block report mechanism, we cannot exceed 100-200K blocks (or files) per node, thereby limiting our 10-node cluster to less than 2M files. While we can work around that by concatenating files into larger files, that adds tremendous complexity.
  • We need much higher performance when creating and processing a large number of files (especially small files). Hadoop is extremely slow.
  • We have had outages and latency spikes due to garbage collection on the NameNode. Although  CMS (concurrent mark and sweep) can be used for garbage collector, the NameNode still freezes occasionally, causing the DataNodes to lose connectivity (i.e., become blacklisted).
  • When we change permissions on a file (chmod 400 arch), the changes do not affect existing clients who have already opened the file. We have no way of knowing who the clients are. It’s impossible to know when the permission changes would really become effective, if at all.
  • We have lost data due to various errors on the NameNode. In one case, the root partition ran out of space, and the NameNode crashed with a corrupted edit log.

Monday, December 15, 2014

The Top 10 Big Data Quotes of All Time

  1. “Big data is at the foundation of all the megatrends that are happening today, from social to mobile to cloud to gaming.” – Chris Lynch, Vertica Systems
  2. “Big data is not about the data” – Gary King, Harvard University, making the point that while data is plentiful and easy to collect, the real value is in the analytics.
  3. “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.” – Eric Schmidt, of Google, said in 2010.
  4. “Information is the oil of the 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Gartner Research
  5. “I keep saying that the sexy job in the next 10 years will be statisticians, and I'm not kidding” – Hal Varian, Google
  6.  “You can have data without information, but you cannot have information without data.” Daniel Keys Moran, computer programmer and science fiction author
  7.  “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.” – Atul Butte, Stanford School of Medicine
  8. “Errors using inadequate data are much less than those using no data at all.” Charles Babbage, inventor and mathematician
  9. “To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem – he may be able to say what the experiment died of.” – Ronald Fisher, biologist, geneticist and statistician
  10. “Without big data, you are blind and deaf in the middle of a freeway” – Geoffrey Moore, management consultant and theorist

Thursday, December 11, 2014

Mongo-Hadoop Connector: The Last One I Used.

Last week during my project work, I used mongo-hadoop connector to create link table in 

To use mongo-hadoop connector, you need relevant jars in /usr/lib/hadoop, /usr/lib/hive based on your installation.

You can link a collection in MongoDB to a Hive table as below:

id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
It will create a persons table in HIVE which will show documents in collection or we can say its a kind of VIEW to the documents in collection.
When you execute:
select * from persons;
It will show all the documents present in collection test.persons.
But one MAJOR PITFALL here is that when you execute:
drop table persons;
It will drop collection in MongoDB which will be a BLUNDER when you link a collection with 500 million records (and I had done that mistake!!)
So, when you link a collection in HIVE, always create a EXTERNAL table like:
id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
Using this approach, even when you drop the table in HIVE, your collection in MongoDB will remain intact and you can escape the pain to again import 500 million records.
Sometimes small pointers can make HUGE difference!!

How to Use HDFS Programmatically

While HDFS can be manipulated explicitly through user commands, or implicitly as the input to or output from a Hadoop MapReduce job, you can also work with HDFS inside your own Java applications. (A JNI-based wrapper, libhdfs also provides this functionality in C/C++ programs.)

This section provides a short tutorial on using the Java-based HDFS API. It will be based on the following code listing:

1: import;
2: import;
4: import org.apache.hadoop.conf.Configuration;
5: import org.apache.hadoop.fs.FileSystem;
6: import org.apache.hadoop.fs.FSDataInputStream;
7: import org.apache.hadoop.fs.FSDataOutputStream;
8: import org.apache.hadoop.fs.Path;
10: public class HDFSHelloWorld {
12: public static final String theFilename = “hello.txt”;
13: public static final String message = “Hello, world!\n”;
15: public static void main (String [] args) throws IOException {
17: Configuration conf = new Configuration();
18: FileSystem fs = FileSystem.get(conf);
20: Path filenamePath = new Path(theFilename);
22: try {
23: if (fs.exists(filenamePath)) {
24: // remove the file first
25: fs.delete(filenamePath);
26: }
28: FSDataOutputStream out = fs.create(filenamePath);
29: out.writeUTF(message;
30: out.close();
32: FSDataInputStream in =;
33: String messageIn = in.readUTF();
34: System.out.print(messageIn);
35: in.close();
46: } catch (IOException ioe) {
47: System.err.println(“IOException during operation: ” + ioe.toString());
48: System.exit(1);
49: }
40: }
41: }

This program creates a file named hello.txt, writes a short message into it, then reads it back and prints it to the screen. If the file already existed, it is deleted first.

First we get a handle to an abstract FileSystem object, as specified by the application configuration. The Configuration object created uses the default parameters.

17: Configuration conf = new Configuration();
18: FileSystem fs = FileSystem.get(conf);

The FileSystem interface actually provides a generic abstraction suitable for use in several file systems. Depending on the Hadoop configuration, this may use HDFS or the local file system or a different one altogether. If this test program is launched via the ordinary ‘java classname’ command line, it may not find conf/hadoop-site.xml and will use the local file system. To ensure that it uses the proper Hadoop configuration, launch this program through Hadoop by putting it in a jar and running:

$HADOOP_HOME/bin/hadoop jar yourjar HDFSHelloWorld

Regardless of how you launch the program and which file system it connects to, writing to a file is done in the same way:

28: FSDataOutputStream out = fs.create(filenamePath);
29: out.writeUTF(message);
30: out.close();

First we create the file with the fs.create() call, which returns an FSDataOutputStream used to write data into the file. We then write the information using ordinary stream writing functions; FSDataOutputStream extends the class. When we are done with the file, we close the stream with out.close().

This call to fs.create() will overwrite the file if it already exists, but for sake of example, this program explicitly removes the file first anyway (note that depending on this explicit prior removal is technically a race condition). Testing for whether a file exists and removing an existing file are performed by lines 23-26:

23: if (fs.exists(filenamePath)) {
24: // remove the file first
25: fs.delete(filenamePath);
26: }

Other operations such as copying, moving, and renaming are equally straightforward operations on Path objects performed by the FileSystem.

Finally, we re-open the file for read, and pull the bytes from the file, converting them to a UTF-8 encoded string in the process, and print to the screen:

32: FSDataInputStream in =;
33: String messageIn = in.readUTF();
34: System.out.print(messageIn);
35: in.close();

The method returns an FSDataInputStream, which subclasses Data can be read from the stream using the readUTF() operation, as on line 33. When we are done with the stream, we call close() to free the handle associated with the file.

More information:

Complete JavaDoc for the HDFS API is provided at

Another example HDFS application is available on the Hadoop wiki. This implements a file copy operation.

Sunday, December 7, 2014

Few Common Problems You Face While Installing Hadoop

Some common problems, that a person faces while installing Hadoop. Here are few problems listed below.

1. Problem with ssh configuration.
 error: connection refused to port 22

2. NameNode node not reachable
    error: Retrying to connect

1. Problem with ssh configuration: In this case you may face many kind of errors, but most common one while installing Hadoop is connection refused to port 22. Here you should check if machine on which you are trying to login, should have ssh server installed.
If you are using Ubuntu, you can install ssh server using following command.
   $sudo apt-get install openssh-server
   On CentOs or Redhat you can install ssh server using yum package manager
   $sudo yum install openssh-server
   After you have installed ssh server, make sure you have configured the keys properly and share public key with the machine that you want to login into. If the problem persists then check for configurations of ssh in your machine. you can check configuration in /etc/ssh/sshd_config file. use following command to read this file
   $sudo gedit /etc/ssh/sshd_config
   In this file RSA Authentication should be set to yes, password less authentication also should be yes.
   after this close the file and restart ssh with following command
   $sudo /etc/init.d/ssh restart
   Now your problem should be resolved. Apart from this error you can face one more issue. Even though you have configured keys correctly, ssh is still prompting for password. In that case check if keys are being managed by ssh. For that run following command. your keys should be in 

   $HOME/.ssh folder

 2. If your Namenode is not reachable, first thing you should check is demons running on Namenode machine. you can check that with following command

   This command tells you all java processes running on your machine. If you don't see Namenode in the output list, do the following. Stop the hadoop with following command.
   Format the Namenode using following command
   $HADOOP_HOME/bin/hadoop namenode -format
   start hadoop with following command
   This time Namenode should run. if you are still not able to start namenode. then check for core-site.xml file in conf directory of hadoop with following command
   $gedit $HADOOP_HOME/conf/core-site.xml
   Check for value for property hadoop.tmp.dir. it should be set to a path where user who is trying to run hadoop has write permissions. if you dont want to scratch your head on this set it to $HOME/hadoop_tmp directory. Now save and close this file. Format the namenode again and try starting hadoop again. Things should work this time.

   Thats all for this posts, Please share problems that you are facing, we will try to solve them together. stay tuned for more stuff :)   

Friday, December 5, 2014

Hive : How to install it on top of Hadoop in Ubuntu

What is Apache Hive?

Apache Hive is a data warehouse infrastructure that facilitates querying and managing large data sets which resides in distributed storage system. It is built on top of Hadoop and developed by Facebook. Hive provides a way to query the data using a SQL-like query language called HiveQL(Hive query Language).

Internally, a compiler translates HiveQL statements into MapReduce jobs, which are then submitted to Hadoop framework for execution.

Difference between Hive and SQL?

Hive looks very much similar like traditional database with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences:

As Hadoop is intended for long sequential scans and Hive is based on Hadoop, you would expect queries to have a very high latency. It means that Hive would not be appropriate for those applications that need very fast response times, as you can expect with a traditional RDBMS database.

Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Hive Installation on Ubuntu:

Follow the below steps to install Apache Hive on Ubuntu:

Step 1:  Download Hive tar.

Download the latest Hive version from here

Step 2: untar the file.

Step 3: Edit the “.bashrc” file to update the environment variables for user.

   $sudo gedit .bashrc

Add the following at the end of the file:

export HADOOP_HOME=/home/user/hadoop-2.4.0
export HIVE_HOME=/home/user/hive-0.14.0-bin
export PATH=$PATH:$HIVE_HOME/bin

Step 4:  Create Hive directories within HDFS.

NOTE: Run the commands from bin folder of hadoop[installed]

$hadoop fs -mkdir /user/hive/storage

The directory ‘storage’ is the location to store the table or data related to hive.

$hadoop fs -mkdir /tmp

The temporary directory ‘tmp’is the temporary location to store the intermediate result of processing.

Step 5: Set read/write permissions for table.

In this command we are giving written permission to the group:

$hadoop fs -chmod 774  /user/hive/warehouse

$hadoop fs -chmod 774  /tmp

Step 6:  Set Hadoop path in Hive

cd hadoop // my current directory where hadoop is stored.
cd hive*-bin
cd bin
sudo gedit

In the configuration file , add the following

export HADOOP_HOME=/home/user/hadoop-2.4.0

Step 7: Launch Hive.

***[run from bin of Hive] 

Command: $hive

Step 8: Test your setup

$show tables; 

Don't forget to put semicolon after this command :P
Press Ctrl+C to exit Hive 

Happy Hadooping! ;) 

Wednesday, December 3, 2014

Books and Related Stuffs You Need to Get Started with Hadoop

In my last few blogs, I provided the basic knowledge of Hadoop and HDFS. Few on Mapreduce too. The emails I got from various readers of the blogs are appreciating. Many of the readers got attracted to Big Data and Hadoop Technology. 

For a further help, I  would like to let you know about a few of the books and web resources from where you can start reading the same. This blog is dedicated to the same.

So if you are reading my blog articles and interested in learning Hadoop, you must be familiar about the power of Big Data and why people are going gaga over this Big Data.

You can refer to these small articles about Big Data ,HDFS and Mapreduce.

You may like to read about Pre-requisites for getting started with Big Data Technologies to get yourself started with Big Data technologies.

Now the main topic of the blog.

The first book i would recommend you guys out there will be: Hadoop The Definitive Guide 3rd Edition by Tom White. I started my Big Data Journey with this book and believe me it is the best resource for you if you are naive in the Big Data World. The book is elegantly written to understand the concept topic-wise. It also gives you an Example of Wearther Dataset which is carried almost through out the book to help you understand how things go in hadoop.

The second book I like reading and which is also very helpful is: Hadoop in Practice by Alex Holmes. Hadoop in Practice collects 85 battle-tested examples and presents them in a problem/solution format. It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning how to build a specific solution along with the thinking that went into it. As a bonus, the book's examples create a well-structured and understandable codebase you can tweak to meet your own needs.

The third one which is written real simpl will be: Hadoop in Action by Chuck Lam. Hadoop in Action teaches readers how to use Hadoop and write MapReduce programs. The intended readers are programmers, architects, and project managers who have to process large amounts of data offline. Hadoop in Action will lead the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs. 
Note: this book uses old Hadoop API

And lastly if you are more into administrative side you can go for Hadoop's Operations by Eric Sammer. Along with the development this book talks mainly about administrating and maintenance of huge clusters for large data-set in the production environment. Eric Sammer, Principal Solution Architect at Cloudera, shows you the particulars of running Hadoop in production, from planning, installing, and configuring the system to providing ongoing maintenance.

Well these are the books that you can refer for your understanding and better conceptual visualization and practical Hands-on of working with Hadoop Farmework

Apart from these books if want to go for the API, you can see Hadoop API Docs here and also very useful is: Data-Intensive Text Processing with MapReduce

Hope you will find these books and resources helpful to understand in-depth of Hadoop and its power.

If you have any question or you want any specific tutorial on Hadoop you can go request for the same in the email address. I will try to get back to you as soon as possible :)


Tuesday, December 2, 2014

How to Install Hadoop on Ubuntu or any other Linux Platform

Following are the steps for installing Hadoop. I have just listed the steps with very brief explanation at some places. This is more or less like some reference notes for installation. I made a note of this when I was installing Hadoop on my system for the very first time.

Please let me know if you need any specific details.

Installing HDFS (Hadoop Distributed File System)
OS : Ubuntu

Installing Sun Java on Ubuntu
$sudo apt-get update
$sudo apt-get install oracle-java7-installer
$sudo update-java-alternatives -s java-7-oracle

Create hadoop user

$sudo addgroup hadoop
$sudo adduser —ingroup hadoop hduser

Install SSH Server if not already present. This is needed as hadoop does an ssh into localhost for execution.

$ sudo apt-get install openssh-server
$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Installing Hadoop

Download hadoop from Apache Downloads.

Download link for latest hadoop 2.6.0 can be found  here
Download hadoop-2.6.0.tar.gz from the link.
mv hadoop-2.6.0  hadoop

Edit .bashrc

# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Add Hadoop bin/ directory to PATH


We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:

$gedit /home/hduser/hadoop/conf/

Add the following

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Temp directory for hadoop

$mkdir /home/hduser/tmp

Configurations for hadoop

cd home/hduser/hadoop/conf/

Then add the following configurations between <configuration> .. </configuration> xml elements:


<description>A base for other temporary directories.</description>

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>


Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:

<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

Formatting NameNode

You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.

Run the following command

$/home/hduser/hadoop/bin/hadoop namenode -format

Starting Hadoop Cluster

From hadoop/bin


To check for processes running use:


If jps is not installed, do the following

sudo update-alternatives --install /usr/bin/jps jps /usr/lib/jvm/jdk1.6/bin/jps 1

Tasks running should be as follows:


NOTE : This is for single node setup.If you configure it for cluster node setup, the demons will be shown in the specific serves.

Stopping Hadoop Cluster

From hadoop/bin


Example Application to test success of hadoop:

Follow my this post to test whether it is successfully configured or not :)

For any other query, feel free to comment in the below thread. :)

Happy Hadooping.