Big Data? Hadoop? Sure I get it! What else?

Christmas holidays are over, back in business. Here we are Eduardo Cereto and me ready to move truckloads of big data :)

Truckloads of Big Data

Rise hands those having a decent understanding of what is Big Data in technical terms. Let's say 'Hadoop', that has become lately a collateral buzzword almost equivalent to Big Data.

Rise hands those who have implemented and played around a bit with Hadoop and such.

Lastly, how many of you work in a daily basis with HiveQL or Pig Latin?

Well, I could mention NoSQL instead to lower the level and it would probably make no difference.

So, why so many people are using those words in their daily chatting without a clue of what they mean? Why many job offers asking for data analysts include as requisite having a guru level on those technologies when they barely understand them and no idea if they really need them?

Let's be honest, most of us manage to pay invoices working just with the basics, a bit of SQL to query the usual databases (Microsoft or MySql), Google Analytics to grab data from sites and Excel, Tableau if you are lucky, for reports, dashboarding and visual analysis.

Don't get me wrong, I'm not an expert at all, far from that. Trying to do my homework and sharing it with you. Not interested in discuss what Big Data is, boring thing, but explaining the simplest way possible the technicalities around it.

In the beginning there was MapReduce

It's a programming model/framework for processing huge data sets using a large number of computers, 'nodes', commonly referred to as a 'cluster'. This distributed processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).

In plain english, we have many computers working together to handle tons of data, no matter what kind.

Google Datacenter. Pic by Connie Zhou
Google Datacenter. Pic by Connie Zhou

MapReduce usually refers to an implementation of that model created by Google although NoSQL databases, like the popular MongoDB, can also provide commands to perform map-reduce operations.

Then Hadoop was born

Hadoop, open-source, was derived from Google's MapReduce and and Google File System. It has become the most popular data-intensive distributed applications.


On June 13 Facebook claimed that they had the largest Hadoop cluster in the world with 100 PB of storage. November 2012 they said it grows by half a PB per day. Crazy uh?

Nice but I still have no clue on how I run a simple SELECT query on Hadoop.

Hives and Pigs

Hadoop excels as a processing platform but apparently it sucks as a query tool because it was designed as a batch-oriented system and nothing to do as a real-time query engine we all have in mind. Facebook guys started to develop Hive to achieve an easy to use, SQL-like query language called HiveQL.

In short, Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. How it works: a compiler translates HiveQL statements (queries) into MapReduce jobs, which are submitted to Hadoop for execution.

In Michael Driscoll words "Thus Hadoop has solved the challenge of economically processing data at scale. Hive has solved the challenge of hand-writing Hadoop queries".

Hive is not alone, Pig from Yahoo!, now incubating at Apache, is another platform to accomplish the same tasks that also has its query language named Pig Latin.


  • Query language: HiveQL, Pig Latin
  • Translates queries into MapReduce jobs: Hive, Pig
  • MapReduce jobs execution & data handling: Hadoop

No SELECTs yet? Patience, let me finish, this evolves everyday.

Next: Need for speed

Querying terabytes of data to perform some analysis and expect to have results in the blink of an eye is not realistic. Technology is quickly becoming very good at that but the problem leans more in the amount of data to process and not so at technology level so there still are some challenges to solve.

Solutions to increase response time point to, for example, pre-processing data generating summarized data chunks that are allocated in faster access containers what lets you load it comfortably into desktop analytics tools like Tableau, Qlikview or such. Metamarkets is working in that direction.

Some other solutions like the Cloudera Impala project, tries to accelerate it directly accessing the data through a specialized distributed query engine, similar to Google's Dremel or Apache Drill.

It is not only about solving size issues but the speed too.

Ok but how do I query Hive, Hadoop or whatever the heck it is?

All that sounds fantastic but you have data to analyze, work to do so, please, make my day and tell me how I extract the data I need executing some SQL queries using something with a human-friendly interface.

Hive has a JDBC Driver that integrates with SQuirrel SQL Client that is more the clasic database administration/query tool we all know in the lines of Microsoft SQL Query Analyzer or phpMyAdmin to mane a couple.

SQL Query Analyze Tool

It also integrates with Pentaho, business analytics tool, I just discovered while researching for this post.

Not the most scientificaly acurated explanation but now you get it, right? For you and me, that we don't make a living implementing complex frameworks for processing huge data sets in clusters, all we need to know is the same old SQL with some more new specific additions.

Next time an HR interviews you for a data analyst position and asks you how much you know about Big Data and Hadoop as the coolest question around the block don't chicken out, if you are familiar with SELECTs, FROM, GROUP BYs, JOINs and such you are safe.

Jan 11, 2013
Written by:
Filed under: Analytics

Have your say
twitter @anilopez

Articles I write for other sites

On Paella and Semantic Markup for recipies

In plain words, it does not work fine most of the cases. It's a bad idea. I'll explain why while I teach you how to cook an authentic Paella.

Analytics Tribulations Of An SEO

The art of measure is never easy but when it comes to SEO it's even worst

Challenges of Spanish Language on Search Marketing @ Multilingual Search

'Standard Spanish' is something that I donít buy into when it comes to international scenarios. I'll explain to you why and some tips to start facing correctly your Spanish strategy.

Handling Multilingual Sites for Humans & Search Engines @ Bruce Clay Blog

The logic behind the scenes to show all content to bots and the right language to users

Mobile detection issues & Google Instant Previews @ Cardinal Path blog

Mobile web represents the bigger headache ever for those wanting to target the small but growing audience they represent nowadays. check your Instant Previews for possible indexation issues.