Pig and Python

Pig is composed of two major parts: a high-level data flow language called Pig Latin, and an engine that parses, optimizes, and executes the Pig Latin scripts as a series of MapReduce jobs that are run on a Hadoop cluster.

Compared to Java MapReduce, Pig is easier to write, understand, and maintain because it is a data transformation language that allows the processing of data to be described as a sequence of transformations.

Pig is also highly extensible through the use of the User Defined Functions (UDFs) which allow custom processing to be written in many languages, such as Python.

An example of a Pig application is the Extract, Transform, Load (ETL) process that describes how an application extracts data from a data source, transforms the data for querying and analysis purposes, and loads the result onto a target data store.
Once Pig loads the data, it can perform projections, iterations, and other transformations. UDFs enable more complex algorithms to be applied during the transformation phase. After the data is done being processed by Pig, it can be stored back in HDFS.

This chapter begins with an example Pig script. Pig and Pig Latin are then introduced and described in detail with examples. The chapter concludes with an explanation of how Pig’s core features can be extended through the use of Python.

WordCount in Pig

Example 3-1 implements the WordCount algorithm in Pig. It assumes that a a data file, input.txt, is loaded in HDFS under /user/ hduser/input, and output will be placed in HDFS under /user/hduser/output.

Example 3-1. pig/wordcount.pig
%default INPUT '/user/hduser/input/input.txt';
%default OUTPUT '/user/hduser/output';
-- Load the data from the file system into the relation records
records = LOAD '$INPUT';
-- Split each line of text and eliminate nesting
terms = FOREACH records GENERATE FLATTEN(TOKENIZE((chararray) $0))
AS word;
-- Group similar terms
grouped_terms = GROUP terms BY word;
-- Count the number of tuples in each group
word_counts = FOREACH grouped_terms GENERATE COUNT(terms), group;
-- Store the result
STORE word_counts INTO '$OUTPUT';

To execute the Pig script, simply call Pig from the command line and pass it the name of the script to run:

$ pig wordcount.pig

While the job is running, a lot of text will be printed to the console. Once the job is complete, a success message, similar to the one below, will be displayed:

2015-09-26 14:15:10,030 [main] INFO org.apache.pig.backend.
hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
Success!
2015-09-26 14:15:10,049 [main] INFO org.apache.pig.Main - Pig
script completed in 18 seconds and 514 milliseconds (18514 ms)

The results of the wordcount.pig script are displayed in Example 3-2 and can be found in HDFS under /user/hduser/output/pig_wordcount/ part-r-00000.

READ  Python Data Types

Leave a Reply

Your email address will not be published.