Big Data? Hadoop? Sure I get it! What else?
Christmas holidays are over, back in business. Here we are Eduardo Cereto and me ready to move truckloads of big data :)
Rise hands those having a decent understanding of what is Big Data in technical terms. Let's say 'Hadoop', that has become lately a collateral buzzword almost equivalent to Big Data.
Rise hands those who have implemented and played around a bit with Hadoop and such.
Lastly, how many of you work in a daily basis with HiveQL or Pig Latin?
Well, I could mention NoSQL instead to lower the level and it would probably make no difference.
So, why so many people are using those words in their daily chatting without a clue of what they mean? Why many job offers asking for data analysts include as requisite having a guru level on those technologies when they barely understand them and no idea if they really need them?
Let's be honest, most of us manage to pay invoices working just with the basics, a bit of SQL to query the usual databases (Microsoft or MySql), Google Analytics to grab data from sites and Excel, Tableau if you are lucky, for reports, dashboarding and visual analysis.
Don't get me wrong, I'm not an expert at all, far from that. Trying to do my homework and sharing it with you. Not interested in discuss what Big Data is, boring thing, but explaining the simplest way possible the technicalities around it.
In the beginning there was MapReduce
It's a programming model/framework for processing huge data sets using a large number of computers, 'nodes', commonly referred to as a 'cluster'. This distributed processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).
In plain english, we have many computers working together to handle tons of data, no matter what kind.
Google Datacenter. Pic by Connie Zhou
Then Hadoop was born
Hadoop, open-source, was derived from Google's MapReduce and and Google File System. It has become the most popular data-intensive distributed applications.
On June 13 Facebook claimed that they had the largest Hadoop cluster in the world with 100 PB of storage. November 2012 they said it grows by half a PB per day. Crazy uh?
Nice but I still have no clue on how I run a simple SELECT query on Hadoop.
Hives and Pigs
Hadoop excels as a processing platform but apparently it sucks as a query tool because it was designed as a batch-oriented system and nothing to do as a real-time query engine we all have in mind. Facebook guys started to develop Hive to achieve an easy to use, SQL-like query language called HiveQL.
In short, Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. How it works: a compiler translates HiveQL statements (queries) into MapReduce jobs, which are submitted to Hadoop for execution.
In Michael Driscoll words "Thus Hadoop has solved the challenge of economically processing data at scale. Hive has solved the challenge of hand-writing Hadoop queries".
Hive is not alone, Pig from Yahoo!, now incubating at Apache, is another platform to accomplish the same tasks that also has its query language named Pig Latin.
- Query language: HiveQL, Pig Latin
- Translates queries into MapReduce jobs: Hive, Pig
- MapReduce jobs execution & data handling: Hadoop
No SELECTs yet? Patience, let me finish, this evolves everyday.
Next: Need for speed
Querying terabytes of data to perform some analysis and expect to have results in the blink of an eye is not realistic. Technology is quickly becoming very good at that but the problem leans more in the amount of data to process and not so at technology level so there still are some challenges to solve.
Solutions to increase response time point to, for example, pre-processing data generating summarized data chunks that are allocated in faster access containers what lets you load it comfortably into desktop analytics tools like Tableau, Qlikview or such. Metamarkets is working in that direction.
It is not only about solving size issues but the speed too.
Ok but how do I query Hive, Hadoop or whatever the heck it is?
All that sounds fantastic but you have data to analyze, work to do so, please, make my day and tell me how I extract the data I need executing some SQL queries using something with a human-friendly interface.
Hive has a JDBC Driver that integrates with SQuirrel SQL Client that is more the clasic database administration/query tool we all know in the lines of Microsoft SQL Query Analyzer or phpMyAdmin to mane a couple.
It also integrates with Pentaho, business analytics tool, I just discovered while researching for this post.
Not the most scientificaly acurated explanation but now you get it, right? For you and me, that we don't make a living implementing complex frameworks for processing huge data sets in clusters, all we need to know is the same old SQL with some more new specific additions.
Next time an HR interviews you for a data analyst position and asks you how much you know about Big Data and Hadoop as the coolest question around the block don't chicken out, if you are familiar with SELECTs, FROM, GROUP BYs, JOINs and such you are safe.