domingo, 16 de noviembre de 2014

Data warehouse, this is Big Data: Hauska tutustua!

In case you are wondering "Hauska tutustua" means nice to meet you. Although I've been living in Finland for 3 years, I have to admit that my Finnish language skills are almost 0, but since day 3 here I knew the expression.

Going back to the topic:

One of my favourite definitions of a data warehouse is the one from Bill Inmon:
"A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process."


One great definition for Big Data is this one from IBM:
"Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data."


I can imagine some uses of pure Big Data like:

Security: Intelligent algorithms crawling over millions of logs from the devices in our networks (routers, firewall, etc.) trying to detect anomalies (possible hacking attempts), and alerting the digital security officers.

On line patterns: Analyse every aspect of the customers, where they click, how much time they spend watching a specific product before they click buy it, etc. 

And many others... 

But what about a relation between Big Data and the data warehouse: Should it exist? or should big data replace the data ware house?


My answers are yes (for the relation) and no (for replacing it).

The yes comes from personal ideas like this:
Big data can preprocess tons of data, and at the end provide simple KPIs that can be loaded into the subject-oriented data warehouse.
Imagine a sales warehouse where we have data like: what have been sold, to whom, for which amount, etc. We can easily add to the data warehouse a new KPI, like number of positive and negative reviews in the social media for those products.


In this table we have the yellow coloured KPIs coming from our transactional sales system loaded into our data warehouse, and the green coloured ones were first processed by our big data solution, and then the results were also loaded into the data ware house.

Lets put some numbers, from the transactional system we loaded 100.000 transactions for the yellow columns, and for the green column big data processed 10.000.000 posts from Facebook and Twitter about the products in different parts of the world, and provided us with 4 records that are then loaded into the data warehouse.

Since companies have invested a lot of time in building and connecting their data warehouses to all their transactional systems, replacing them with new systems powered by big data is not a trivial task; at least for some years, I think both technologies will co-exist and need to be integrated.

Have a great Sunday !