SQL on Hadoop has been extensively covered in the media in
the last year. Pig, being a well-established technology, has been largely
overlooked though Pig as a Service was a noteworthy development. Considering
Hadoop as a data platform though requires Pig and an understanding why and how
it is important. Data users are
generally trained in using SQL, a declarative language, to query for data for
reporting, analytic and ad-hoc explorations. SQL does not describe how the
data is processed; it is more declarative and appeals to a lot of data users.
ETL(Extract, Transform and Load) processes, which are developed by data programmers, benefit and sometimes
even require the ability to detail the data transformation steps. At times ETL
programmers like a procedural language as opposed to a declarative language.
Pig’s programming language, Pig Latin, is procedural and gives programmers
control over every step of the processing.
Business users and programmers work on the same data set yet usually
focus on different stages. The programmers commonly work on the whole ETL
pipeline, i.e. they are responsible to clean and extract the raw data,
transform it and load it into third party systems. Business users either access
data on third party systems or access the extracted and transformed data for
analysis and aggregation. The requirement of diverse tooling is therefore
important as the interaction patterns with the same data set are divers. Importantly, complex ETL workflows need
management, extensibility, and test-ability to ensure stable and reliable data
processing. Pig provides strong support on all aspects. Pig jobs can be
scheduled and managed with workflow tools like Oozie to build and orchestrate
large scale, graph-like data pipelines.
Pig achieves extensibility with UDFs (User Defined Function), which let
programmers add functions written in one of many programming languages. The
benefit of this model is that any kind of special functionality can be injected
and that Pig and Hadoop manage the distribution and parallel execution of the
function on potentially huge data sets in an efficient manner. This allows the
programmers to focus on adding and solving specific domain problems, e.g. like
rectifying specific data set anomalies or converting data formats, without
worrying about the complexity of distributed computing. Reliable data pipelines require testing
before deployment in production to ensure correctness of the numerous data
transformation and combination steps. Pig has features supporting easy and
testable development of data pipelines. Pig supports unit tests, an interactive
shell, and the option to run in a local mode, which allows it to execute
programs in a fashion not requiring a Hadoop cluster. Programmers can use these
to test their Pig programs in detail with test data sets before they ever enter
production and also help them try out ideas quickly and inexpensively, which is
essential for fast development cycles.
None of these features are particularly glamorous yet they are important
to evaluate Hadoop and data processing with it. The choice of leveraging Pig
for a big data project can easily make the difference between success and
failure.
No comments:
Post a Comment