Monday, December 28, 2015

Digging more into NoSQL- Explaining Redis

In this era of Big Data, the first thing we need is an in-memory data structure. In this post, I will elaborate a useful in-memory store, called Redis.
As they say If you can map a use case to Redis and discover you aren't at risk of running out of RAM by using Redis there is a good chance you should probably use Redis.

What is Redis?
A simplest answer can be, its only a data structure server. It can be differentiated from MongoDB, which is a disk-based document store. Although MongoDB can be used as a key-value store too.

Those two additions may seem pretty minor, but they are what make Redis pretty incredible.Persistence to disk means you can use Redis as a real database instead of just a volatile cache. The data won't disappear when you restart, like with memcached.

The additional data types are probably even more important. Key values can be simple strings, like one will find in memcached, but they can also be more complex types like Hashes, Lists (ordered collection, makes a great queue), Sets (unordered collection of non-repeating values), or Sorted Sets (ordered/ranked collection of non-repeating values).

The entire data set in Redis, is stored in-memory so it is extremely fast, often even faster than memcached. Redis had virtual memory, where rarely used values would be swapped out to disk, so only the keys had to fit into memory, but this has been deprecated. Going forward the use cases for Redis are those where its possible (and desirable) for the entire data set to fit into memory.

Redis becomes fantastic choice if you want a highly scalable data store shared by multiple processes, multiple applications, or multiple servers. As just an inter-process communication mechanism it is tough to beat. The fact that you can communicate cross-platform, cross-server, or cross-application just as easily makes it a pretty great choice for many many use cases. Its speed also makes it great as a caching layer.

For more info :

Wednesday, December 23, 2015

How a process is Deamon-ized in Linux

A Linux process works either in foreground or background.

A process running in foreground can interact with the user in front of the terminal. To run a.out in foreground we execute as shown below.
When a process runs in the foreground, it can interact with the users in front of the terminal. To run, we execute the command a.out.
 However, for a background process, it runs without any interaction of any user’s interaction. But obviously a user can check its current status, though he doesn’t know or might be doesn’t need to know what it is doing. 
The command is similar to the other, with some changes:
$ ./a.out  &  
[1] 8756
As shown above when we run a process with '&' at the end, then the process runs in background and returns the process id (8756 in above example).

Now back to the current topic.

What is actually a DAEMON Process?
A 'daemon' process is a process that runs in the background, begins execution at startup 
(not necessarily), runs forever, usually do not die or get restarted, waits for requests to arrive and respond to them and frequently spawn other processes to handle these requests.

So running a process in BACKGROUND with a while loop logic in code to loop forever makes a Daemon ? Yes and also No. But there are certain things to be considered when we create a daemon process. 

Let's follow a step-by-step procedure to create a daemon process.

1. Create a separate child process - fork() it.

fork() system call create a copy of our process(child), then let the parent process exit. Once the parent process exits the Orphaned child process will become the child of init process (this is the initial system process, in other words the parent of all processes). As a result our process will be completely detached from its parent and start operating in background.

pid=fork();    if (pid<0) exit(1); /* fork error */    if (pid>0) exit(0); /* parent exits */    /* child (daemon) continues */  

2. Make child process In-dependent - setsid()

Before we move to check how we are going to make a child process independent, let  talk a bit about Process group and Session ID.

A process group denotes a collection of one or more processes. Process groups are used to control the distribution of signals. A signal directed to a process group is delivered individually to all of the processes that are members of the group. 

Process groups are themselves grouped into sessions. Process groups are not permitted to migrate from one session to another, and a process may only create new process groups belonging to the same session as it itself belongs to. Processes are not permitted to join process groups that are not in the same session as they themselves are.

New process images created by a call to a function of the exec family and fork() inherit the process group membership and the session membership of the parent process image.

A process receives signals from the terminal that it is connected to, and each process inherits its parent's controlling tty. A daemon process should not receive signals from the process that started it, so it must detach itself from its controlling tty.

In Unix , processes operates within a process group, so that all processes within a group is treated as a single entity. Process group or session is also inherited. A daemon process should operate independently from other processes.

setsid() system call is used to create a new session containing a single (new) process group, with the current process as both the session leader and the process group leader of that single process group. 

3. Change current Running Directory - chdir()

A daemon process should run in a known directory. There are many advantages, in fact the opposite has many disadvantages: suppose that our daemon process is started in a user's home directory, it will not be able to find some input and output files. If the home directory is a mounted filesystem then it will even create many issues if the filesystem is accidentally un-mounted.

The root "/" directory may not be appropriate for every server, it should be chosen carefully depending on the type of the server.

4. Close Inherited Descriptors and Standard I/O Descriptors

A child process inherits default standard I/O descriptors and opened file descriptors from a parent process, this may cause the use of resources un-neccessarily. Unnecessary file descriptors should be closed before fork() system call (so that they are not inherited) or close all open descriptors as soon as the child process starts running as shown below.

for ( i=getdtablesize(); i>=0; --i)    
close(i); /* close all descriptors */  
There are three standard I/O descriptors: 
  1. standard input 'stdin' (0),
  2. standard output 'stdout' (1),
  3. standard error 'stderr' (2).

For safety, these descriptors should be opened and connected to a harmless I/O device (such as /dev/null).

int fd; 
fd = open("/dev/null",O_RDWR, 0); 
   if (fd != -1)     
 dup2 (fd, STDIN_FILENO); 
dup2 (fd, STDOUT_FILENO);    
dup2 (fd, STDERR_FILENO);       
 if (fd > 2)   
 close (fd); 

5. Reset File Creation Mask - umask()

Most Daemon processes runs as super-user, for security reasons they should protect files that they create. Setting user mask will prevent unsecure file priviliges that may occur on file creation.
view plainprint?
This will restrict file creation mode to 750 (complement of 027).

Friday, June 5, 2015

How to parse XML using XPath with Nokogiri Ruby : A Begining in Web Crawl

Xpath is a language used to find information in an XML or HTML files. XPath is used to navigate through several attributes or elements in an XML document. XPath can also be used to traverse through an XML file in Ruby. We use Nokogiri, a gem of Ruby for that purpose.

XPath is found to be a very important tool for fetching the relevant information, reading attributes and items in XML file.

Before you start reading this post, I should suggest you to learn a bit about XPath from here.

We will consider the following XML file for the demo, that holds the information of employees. 

 <?xml version="1.0"?>
    <Employee id="1111" type="admin">
    <Employee id="2222" type="admin">
    <Employee id="3333" type="user">
    <Employee id="4444" type="user">

If we go though the code, we can see there are four employees. Attribute-id type, Child nodes - firstname, lastname, age and email.
Lets now start with the code. We will use Nokogiri , a gem of Ruby which provides wonderfulAPI to parse, search the documents via XPath.


Ex 1. Read firstname of all employees
 require 'nokogiri'
f ="employee.xml")
doc = Nokogiri::XML(f)

puts "== First name of all employees"
expression = "Employees/Employee/firstname"
nodes = doc.xpath(expression)

nodes.each do |node|
  p node.text

Output : 


Ex 2: Read firstname of all employees who are older than 40 year
expression = "/Employees/Employee[age>40]/firstname"
nodes = doc.xpath(expression)
nodes.each do |node|
 p "#{ node.text }"



That is for today. I will write more about different process of Web Crawling. It is just a small beginning. Thanks for reading.

Friday, May 8, 2015

Wednesday, May 6, 2015

Parsing a CSV file having "hyphen" separated column value.

One thing I have learnt from past a year or so. If you say you want to know Big Data Analytic, the first thing you should learn is how to parse a CSV ( Comma Separated Value) file.

In this blog, I will explain the how to parse a CSV file having "Hyphen" separated column value. These things are actually asked in the interviews and sometime becomes too tricky.

I will use Java to parse the same.

Here is one file containing various attributes, column and different statistics of Cricket players of various countries. csi-batting.csv

The Question is:

Find the total score of the Afghanistan players in the year of 2010.

For our convenience I will show first few lines of the CSV file here.

** Each paragraph here contains each line in the CSV File"

Afghanistan,Mohammad Shahzad,118,97.52,16-02-2010,Tue,Sharjah CA Stadium,Canada,../Matches/MatchScorecard_ODI.asp?MatchCode=3087

Afghanistan,Mohammad Shahzad,110,99.09,01-09-2009,Tue,VRA Ground,Netherlands,../Matches/MatchScorecard_ODI.asp?MatchCode=3008

Afghanistan,Mohammad Shahzad,100,138.88,16-08-2010,Mon,Cambusdoon New Ground,Scotland,../Matches/MatchScorecard_ODI.asp?MatchCode=3164

Afghanistan,Mohammad Shahzad,82,75.92,10-07-2010,Sat,Hazelaarweg,Netherlands,../Matches/MatchScorecard_ODI.asp?MatchCode=3153

Afghanistan,Mohammad Shahzad,57,100,01-07-2010,Thu,Sportpark Westvliet,Canada,../Matches/MatchScorecard_ODI.asp?MatchCode=3135

Here you will find that, the date column (MatchDate) of the CSV file is hyphen separated. So, the usual to split the columns with (",") will NOT work here.

What can we do then?

Split the columns with (",") and get those in seprated variable and then split that particular MatchDate column with ("-").

Lets see the code here.
© Dipayan Dev


public class A {

public static void main(String args[]) throws FileNotFoundException

/*Put that file according to your wish and change the string */

String csv="C:\\Users\\Dipayan\\Desktop\\odi-batting.csv";  
BufferedReader br=new BufferedReader(new FileReader(csv));

String line=" ";
int sum=0;
int count=0;
int []a=new int[10000];

try {
} catch (IOException e) {
// TODO Auto-generated catch block
try {

String [] f= line.split(","); /* Splitting each column and storing each of them in array f*/ 
String con=f[0];
String date = f[4];
String year = date.split("-")[2]; /* Split the second column using hyphen*/
if (year.equals("2010") && con.equals("Afghanistan")) {
   a[count] = Integer.parseInt(f[2]);
   sum += a[count];
} catch (NumberFormatException | IOException e) {
// TODO Auto-generated catch block


Thursday, February 19, 2015

HDFS Comic : The easiest way to learn Hadoop's File System.

Here is a comic that desribes the various functioning of HDFS.

Disclaimer : This has been taken from another resource.

Joins in Map-Reduce

There are 2 kinds of joins in Map-Reduce - map side join and the reduce side join.

  1. Map-side join - This join happens before the input reaches the map phase. It is suited for 2 scenarios:

One of the inputs is small enough to be fit in memory - Consider the example of some kind of a metadata which needs to be associated with a much larger number of records. In this particular case, the smaller input could be replicated across all the tasktracker nodes in memory and a join could be performed as the bigger input is being read by the mapper.

Both the inputs are sorted and partitioned into equal sizes with the guarantee that records belonging to a key fall in the same partition - Consider the example of outputs coming out of multiple reducer jobs which had equal number of reducers and the same keys emitted. In this case, an index could be built from one of the inputs (key, filename, offset) and it could be looked up as the other input is read.

         2.   Reduce-side join - This join happens at the reducer phase. It places no restrictions on the size of the input, the only disadvantage being that all the data/records (from both the inputs) have to go through the shuffle and sort phase. It works as following : The map phase tags the records with an identifier to distinguish the sources and the parsing logic at the reducer. Records pertaining to the same key reach the same reducer and the reducer takes care of joining, taking care of the fact that records from different source tags need to be parsed and dealt with differently.

Friday, February 6, 2015

Optimum Parameters for Best Practices of Large-Scale File System

 Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders if used correctly and wisely. We all know the usage of these tools. But there are some points, if followed can take the core efficiency out of these tools. 

 Choosing the number of map and reduce tasks for a job is important.

a. If each tasks takes less than 30-40 seconds, reduce the number of tasks. The task setup and scheduling overhead is a few seconds, so if tasks finish very quickly, you are wasting time while not doing work. In simple words, your task is under loaded. Better increase the task load and utilize it to the fullest. Another option can be the reuse of JVM. The JVM spawned by one mapper can be reused by the other one, so that there is no overhead of spawning of an extra JVM.

b. If you are dealing with a huge input data size, for example, suppose 1TB, then consider increasing the block size of the input dataset to 256M or 512M, so that less number of mappers will be spawned. Increasing the number of mappers by decreasing the block size is not a good practice. Hadoop is designed to work on larger amount of data to reduce the disk seek time and increase the computation speed. So always define the HDFS block size larger enough to allow Hadoop to compute effectively.

c. If you have 50 map slots in your cluster, avoid jobs using 51 or 52 mappers, because the first 50 mappers finish at the same time and then the 51st and the 52nd will run before the reducer task can be started. So just increasing the number of mappers to 500, or 1000 or even to 2000 does not speed your job. The mappers will run in parallel according to the map slots available in your cluster. If map slot available is 50 only 50 will run in parallel, others will be in queue, waiting for the map slots to be available.

d. The number of reduce tasks should always be equal less than the reduce slot available in your cluster.

e. Sometime we don’t really use reducers. For example filtering and reduce noise in data. In these cases make sure you set the number of reducers to zero since the sorting and shuffling is an expensive operation.

Friday, January 2, 2015

Real Life Application of HashTable

There are so many situations where you have one piece of information (key) and want a system to give you more information based on that key.

This is the need behind the evolution and implementation of several computing structures such as databases, binary trees, hash tables, and supporting algorithms.

Hash Tables have the following features:

  1. They facilitate a quick and sometimes inexpensive way to retrieve information. They consume little CPU and if small-enough can fit in RAM. The speed is gained from the way the data location is calculated from the key, this is done in almost a linear fashion which outperforms other methods like binary search and linear search
  2. They can store information based on a key.
  3. They are language/technology independent. They don't require special hardware or software to implement them as the majority of programming languages would be sufficient to create the necessary algorithms for them.
  4. They have friendly interface - to get data, you pass a key and to store data, you pass a key in addition to the data.
  5. The theory allows the storage and retrieval of data based on numeric as well as non-numeric key values.
    Hash tables are used in memory during the processing of a program (they can be persisted to disk but that is a different topic), some usages are  :
  6. Facilitation of Associative Arrays - 
  7. Lookup values (states, provinces, etc.). You could load small amounts from database into hash tables for quick lookups (decoding an encoding of data) - This is of paramount effect in large batch jobs for Extract Load and Transform scenarios. It is also very valuable for data validation.
  8.  Data Buffering. You could store frequently used data from a database in a hash table to facilitate quick access.
  9.  Uniqueness Checking - You can use hash tables to ensure that no value is duplicate in a list.
  10. Keyword Recognition - In cases you want to identify if a given text has certain keywords in it or not, instead of checking the database with each value, you could use a hash table
  11.  Decision tables - Large conditional flows may be stored in an array where given a condition id, you could retrieve and execute related code segments (this may be used in interpreted languages).
  12. Game programming could use Hash tables to keep track of player scores, weapons of a player, etc.