Friday, November 14, 2014

Pig: the Silent Hero

SQL on Hadoop has been extensively covered in the media in the last year. Pig, being a well-established technology, has been largely overlooked though Pig as a Service was a noteworthy development. Considering Hadoop as a data platform though requires Pig and an understanding why and how it is important.  Data users are generally trained in using SQL, a declarative language, to query for data for reporting, analytic and ad-hoc explorations. SQL does not describe how the data is processed; it is more declarative and appeals to a lot of data users. ETL(Extract, Transform and Load)  processes, which are developed by data programmers, benefit and sometimes even require the ability to detail the data transformation steps. At times ETL programmers like a procedural language as opposed to a declarative language. Pig’s programming language, Pig Latin, is procedural and gives programmers control over every step of the processing.  Business users and programmers work on the same data set yet usually focus on different stages. The programmers commonly work on the whole ETL pipeline, i.e. they are responsible to clean and extract the raw data, transform it and load it into third party systems. Business users either access data on third party systems or access the extracted and transformed data for analysis and aggregation. The requirement of diverse tooling is therefore important as the interaction patterns with the same data set are divers.  Importantly, complex ETL workflows need management, extensibility, and test-ability to ensure stable and reliable data processing. Pig provides strong support on all aspects. Pig jobs can be scheduled and managed with workflow tools like Oozie to build and orchestrate large scale, graph-like data pipelines.  Pig achieves extensibility with UDFs (User Defined Function), which let programmers add functions written in one of many programming languages. The benefit of this model is that any kind of special functionality can be injected and that Pig and Hadoop manage the distribution and parallel execution of the function on potentially huge data sets in an efficient manner. This allows the programmers to focus on adding and solving specific domain problems, e.g. like rectifying specific data set anomalies or converting data formats, without worrying about the complexity of distributed computing.  Reliable data pipelines require testing before deployment in production to ensure correctness of the numerous data transformation and combination steps. Pig has features supporting easy and testable development of data pipelines. Pig supports unit tests, an interactive shell, and the option to run in a local mode, which allows it to execute programs in a fashion not requiring a Hadoop cluster. Programmers can use these to test their Pig programs in detail with test data sets before they ever enter production and also help them try out ideas quickly and inexpensively, which is essential for fast development cycles.  None of these features are particularly glamorous yet they are important to evaluate Hadoop and data processing with it. The choice of leveraging Pig for a big data project can easily make the difference between success and failure.  

No comments: