Friday, December 5, 2014

Hive : How to install it on top of Hadoop in Ubuntu



What is Apache Hive?

Apache Hive is a data warehouse infrastructure that facilitates querying and managing large data sets which resides in distributed storage system. It is built on top of Hadoop and developed by Facebook. Hive provides a way to query the data using a SQL-like query language called HiveQL(Hive query Language).

Internally, a compiler translates HiveQL statements into MapReduce jobs, which are then submitted to Hadoop framework for execution.

Difference between Hive and SQL?

Hive looks very much similar like traditional database with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences:

As Hadoop is intended for long sequential scans and Hive is based on Hadoop, you would expect queries to have a very high latency. It means that Hive would not be appropriate for those applications that need very fast response times, as you can expect with a traditional RDBMS database.

Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.




Hive Installation on Ubuntu:

Follow the below steps to install Apache Hive on Ubuntu:

Step 1:  Download Hive tar.

Download the latest Hive version from here

Step 2: untar the file.

Step 3: Edit the “.bashrc” file to update the environment variables for user.

   $sudo gedit .bashrc

Add the following at the end of the file:

export HADOOP_HOME=/home/user/hadoop-2.4.0
export HIVE_HOME=/home/user/hive-0.14.0-bin
export PATH=$PATH:$HIVE_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin


Step 4:  Create Hive directories within HDFS.

NOTE: Run the commands from bin folder of hadoop[installed]

$hadoop fs -mkdir /user/hive/storage

The directory ‘storage’ is the location to store the table or data related to hive.

$hadoop fs -mkdir /tmp

The temporary directory ‘tmp’is the temporary location to store the intermediate result of processing.

Step 5: Set read/write permissions for table.

In this command we are giving written permission to the group:

$hadoop fs -chmod 774  /user/hive/warehouse

$hadoop fs -chmod 774  /tmp

Step 6:  Set Hadoop path in Hive config.sh.

cd hadoop // my current directory where hadoop is stored.
cd hive*-bin
cd bin
sudo gedit hive-config.sh

In the configuration file , add the following

export HIVE_CONF_DIR=$HIVE_CONF_DIR
export HIVE_AUX_JAR_PATH=$HIVE_AUX_JAR_PATH
export HADOOP_HOME=/home/user/hadoop-2.4.0

Step 7: Launch Hive.

***[run from bin of Hive] 

Command: $hive

Step 8: Test your setup

$show tables; 

Don't forget to put semicolon after this command :P
Press Ctrl+C to exit Hive 


Happy Hadooping! ;) 





No comments: