Introducing pig-herder - Production ready pig/mapreduce workflow

Intro

Getting started with Pig/Hadoop on EMR or any other platform can be a pretty
daunting task, I found that having a workflow really helps, having everything
laid-out really gets you going smoother.

So today, I am releasing pig-herder, production ready workflow for pig/Hadoop on
Amazon EMR.

What Pig-Herder includes

  1. Sane and friendly directory structure (Both on local and on S3).
  2. Pig script template.
  3. EMR launcher (with the AWS CLI).
  4. Java UDF template with a working schema and full unit test support.
  5. Compilation instructions to make sure it works on EMR.

The workflow

Directory Structure

Organization is super important in almost everything we do as engineers, but
organizing the directory structure to work with pig is crucial (at least for me).

Every Pig/Hadoop workflow has it’s own directory and the same directory structure that looks like this:

├── README.md
├── bootstrap.sh
├── data
│   └── small-log.log
├── jars
│   ├── mysql-connector-java-5.1.35-bin.jar
│   ├── pig-herder.jar
│   ├── piggybank-0.12.0.jar
├── launcher
│   ├── submit-steps.sh
│   └── submit-to-amazon.sh
├── lib
│   └── mysql-connector-java-5.1.35-bin.jar
├── pig-herder-udf
│   ├── internal_libs
│   │   └── date-extractor.jar
│   ├── libs
│   │   ├── hadoop-core-1.2.1.jar
│   │   ├── hamcrest-core-1.3.jar
│   │   ├── junit-4.12.jar
│   │   └── pig-0.12.0.jar
│   ├── main
│   │   ├── main.iml
│   │   └── src
│   │       └── io
│   │           └── avi
│   │               ├── LineParser.java
│   │               ├── LogLineParser.java
│   │               └── QueryStringParser.java
│   ├── out
│   │   ├── artifacts
│   │   │   └── pig_herder
│   │   │       └── pig-herder.jar
│   │   └── production
│   │       ├── main
│   │       │   └── io
│   │       │       └── avi
│   │       │           ├── LineParser.class
│   │       │           ├── LogLineParser.class
│   │       │           └── QueryStringParser.class
│   │       └── tests
│   │           └── io
│   │               └── avi
│   │                   └── tests
│   │                       ├── LogLineParserTest.class
│   │                       └── QueryStringParserTest.class
│   ├── pig-herder-udf.iml
│   └── tests
│       ├── src
│       │   └── io
│       │       └── avi
│       │           └── tests
│       │               ├── LogLineParserTest.java
│       │               └── QueryStringParserTest.java
│       └── tests.iml
├── prepare.sh
├── production_data
├── script.pig
├── start_pig.sh
└── upload.sh

data

All the data you need to test the pig script locally. This usually includes a few log files, flat files and nothing more. This is basically a trimmed down version of what’s in production.

Lets say I want to run on Nginx logs in production, this will include 50K lines from a single log file, it’s a good enough sample that I will be sure it’ll work well on production data as well.

production_data

Same as data but a bigger sample, this will usually be 100K items exported from production, not from local/staging.

This gives a better idea whether we need to sanitize after export for example.

jars

All the jars that the pig script depends on. the prepare.sh script will copy the compiled jar to this folder as well.

launcher

Launcher dir usually includes a couple of files, one that will boot the EMR cluster and another to submit the steps.

submit-to-amazon.sh


date_string=`date -v-1d +%F`

echo "Starting process on: $date_string"

cluster_id=`aws emr create-cluster --name "$CLUSTER_NAME-$date_string" \
  --log-uri s3://$BUCKET_NAME/logs/ \
  --ami-version 3.8.0 \
  --applications Name=Hue Name=Hive Name=Pig \
  --use-default-roles --ec2-attributes KeyName=$KEY_NAME \
  --instance-type m3.xlarge --instance-count 3 \
  --bootstrap-action Path=s3://$BUCKET_NAME/bootstrap.sh | awk '$1=$1' ORS='' | grep ClusterId | awk '{ print $2 }' | sed s/\"//g | sed s/}//g`

echo "Cluster Created: $cluster_id"

sh submit-steps.sh $cluster_id $date_string CONTINUE

submit-steps.sh


cluster_id=$1
date_string=$2
after_action=$3
aws emr add-steps --cluster-id $cluster_id --steps "Type=PIG,Name=\"Pig Program\",ActionOnFailure=$after_action,Args=[-f,s3://$BUCKET_NAME/script.pig,-p,lib_folder=/home/hadoop/pig/lib/,-p,input_location=s3://$BUCKET_NAME,-p,output_location=s3://$BUCKET_NAME,-p,jar_location=s3://$BUCKET_NAME,-p,output_file_name=output-$date_string]"


These will start a cluster and submit the steps to amazon. Keep in mind that you can also start a cluster that will be terminated when idle and also terminate if the step fails.

You can just pass —auto-terminate to in the submit-to-amazon.sh and TERMINATE_CLUSTER to the submit-recalc-poi.sh for example.

README

Very important. You have to make the README as useful and as self explanatory as possible.
If another engineer needs to ask a question or doesn’t have all the flow figured out, you failed.

bootstrap.sh

bootstrap.sh is a shell script to bootstrap the cluster.

Usually, this only includes downloading the mysql jar to the cluster so pig can insert data into mysql when it’s done processing.

jar_filename=mysql-connector-java-5.1.35-bin.jar
cd $PIG_CLASSPATH
wget https://github.com/gogobot/hadoop-jars/raw/master/$jar_filename
chmod +x $jar_filename

We have a Github repo with the public jars we want to download. it’s really a convenient way to distribute public jars that you need.

Upload.sh

Upload everything you need to Amazon.

This step depends on having s3cmd installed and configured

echo "Uploading Bootstrap actions"...
s3cmd put bootstrap.sh s3://$BUCKET_NAME/ >> /dev/null
echo "Uploading Pig Script"...
s3cmd put script.pig s3://$BUCKET_NAME/ >> /dev/null
echo "Uploading Jars..."
s3cmd put jars/pig-herder.jar s3://$BUCKET_NAME/ >> /dev/null
echo "Finished!"

Some projects include many jars, but usually I try to keep it simple, just my UDF and often piggybank.

start_pig.sh

Starting pig locally with the same exact params that will be submitted to production in the submit-steps.sh file.
This means, you can work with the same script on production and local, making testing much easier.

UDF Directory structure.

The directory structure for the Java project is also pretty simple. It includes 2 modules main and test. But after a very long time experimenting I found that the dependency and artifacts settings are crucial to make the project work on production.

Here’s how I start the project on IntelliJ. I believe the screenshots will be enough to convey the majority.

I start the project with just the hello world template (how fitting)

Starting the project

I then go to add the modules main and tests.

Adding modules

I have 2 library folders. lib which does not get exported in the jar file and internal_libs which does.

tests and main both depend on both of them.

tests depends on main obviously for the classes it tests.

Artifacts only output the main module and all the extracted classes from internal_libs. do not export or output the libs folder or it will simply not work on EMR.

Artifacts

Get going

Once you familiarize yourself with the structure, it’s really easy to get going.

pig-herder includes a full working sample for analyzing some logs for impression counts. Nothing crazy but really useful to get going.

Enjoy (Open Source)

pig-herder is open source on github here: https://github.com/kensodev/pig-herder

Feel free to provide feedback/ask questions.