Thursday, December 11, 2014

Mongo-Hadoop Connector: The Last One I Used.

Last week during my project work, I used mongo-hadoop connector to create link table in 
HIVE.

To use mongo-hadoop connector, you need relevant jars in /usr/lib/hadoop, /usr/lib/hive based on your installation.

You can link a collection in MongoDB to a Hive table as below:


CREATE TABLE persons
(
id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.persons');
It will create a persons table in HIVE which will show documents in collection or we can say its a kind of VIEW to the documents in collection.
When you execute:
select * from persons;
It will show all the documents present in collection test.persons.
But one MAJOR PITFALL here is that when you execute:
drop table persons;
It will drop collection in MongoDB which will be a BLUNDER when you link a collection with 500 million records (and I had done that mistake!!)
So, when you link a collection in HIVE, always create a EXTERNAL table like:
CREATE EXTERNAL TABLE persons
(
id INT,
name STRING,
age INT,
work STRUCT<title:STRING, hours:INT>
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","work.title":"job.position"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.persons');
Using this approach, even when you drop the table in HIVE, your collection in MongoDB will remain intact and you can escape the pain to again import 500 million records.
Sometimes small pointers can make HUGE difference!!

No comments: