Pig is nothing but “High-level data flow scripting language” what I mean by high-level is that PIG will take care of all those low level programming that you need to write when using MapReduce to run your jobs in Hadoop Cluster as in MapReduce you need to write a lots of Java code to talk to your cluster and retrieve data.

Think of this as Assembly language when you want to write program in assembly it’s like you talk to machine directly and you need to take care of all those small details while programming “Garbage collection, memory management stuff, compiling” but when you write program in visual basic or any other modern languages you talk to library and that library will talk to machine for you.

The language that pig uses when talking to your data is called “PIG Latin” this language is similar to SQL language but with different style of writing the script, as for as the tool is concern your going to use Grant shell or command line to write your ad-hoc queries, or running your batches.


Let’s focus on comparing PIG and SQL if you’re from the world of SQL language you might know that SQL is declarative language what I mean by that is you write your SQL query as a whole and you expect a result as whole from that but PIG is Procedure Language you do things in PIG step by step, bellow is a picture that shows different between PIG and SQL there is two queries one is written in PIG the other one is written in SQL.

Comparison between pig and SQL This chart shows difference between pig and SQL

As you can see in above example most of the time SQL query run from inside out what I mean by that is when you write your query multiple or single sub-query it always run the sub-query first and then run the main query but in the PIG world it’s always from high level to low level.

In PIG you always focus in three part rather than the result only like Load, Transform and Result it’s very similar to the Relation database world ETL Extract, Transform and Load there is a lots of great tools out there for relation database I usually use Pentaho for ETL but you can find a lots of other interesting alternative tools to use for your relation database.

but when it comes PIG world you always focus on LTR “Load Transform Result” and that is why it’s clean to read and write as we don’t focus on our result only rather it’s step by step process we can always store our data in steps because the data pipeline is open for us to use it and store data in any point and work that data only that is we stored them in multiple stages of LTR.