domingo, 29 de junio de 2014

Big data: It all started with Google and continues that way...

My 40th birthday is approaching and I start to think what has changed in the last decade specially in IT; this are my reflections about Big Data.

One decade ago, Google's people published some papers detailing a new way to analyze huge stores of information. Data was spread in "small" chunks across thousands of servers. When you asked a question, this query was processed by all those servers in parallel, and you got an answer, usually fast enough. They described this method as MapReduce.

Then Yahoo guys decided to implement MapReduce as an open source project called Apache Hadoop.  Now everything related to Big Data is somehow related to Hadoop which has been the hype term for Big Data for some years now. What does Hadoop do? I does MapReduce!

For Hadoop to make sense you have to have some nodes all inter-connected, so when you "ask" something your query is distributed among these nodes.

I think IBM has done a superb job explaining what MapReduce is, you can find it here. It will take 2 minutes to read, and in your next nerd cocktail party you will be able to talk about Hadoop with everyone.

Here is an extract of IBM's explanation:

"As an analogy, you can think of map and reduce tasks as the way a cen­sus was conducted in Roman times, where the census bureau would dis­patch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall popula­tion of the empire. This mapping of people to cities, in parallel, and then com­bining the results (reducing) is much more efficient than sending a single per­son to count every person in the empire in a serial fashion."

It sounds logical to think that Google processes more information that any other company in the world, and because of that, they create tools to handle huge volumes of data, and those tools seem to be 2-5 years ahead of all others. June 25, 2014 they announced a new technology called Google Cloud Dataflow; in short, Cloud Dataflow is a successor to MapReduce...

Still it is too soon (at least for me) to understand the main differences, but so far I think Big Data started with Google and they are still setting the pace of Big Data.