domingo, 16 de noviembre de 2014

Data warehouse, this is Big Data: Hauska tutustua!

In case you are wondering "Hauska tutustua" means nice to meet you. Although I've been living in Finland for 3 years, I have to admit that my Finnish language skills are almost 0, but since day 3 here I knew the expression.

Going back to the topic:

One of my favourite definitions of a data warehouse is the one from Bill Inmon:
"A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process."


One great definition for Big Data is this one from IBM:
"Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data."


I can imagine some uses of pure Big Data like:

Security: Intelligent algorithms crawling over millions of logs from the devices in our networks (routers, firewall, etc.) trying to detect anomalies (possible hacking attempts), and alerting the digital security officers.

On line patterns: Analyse every aspect of the customers, where they click, how much time they spend watching a specific product before they click buy it, etc. 

And many others... 

But what about a relation between Big Data and the data warehouse: Should it exist? or should big data replace the data ware house?


My answers are yes (for the relation) and no (for replacing it).

The yes comes from personal ideas like this:
Big data can preprocess tons of data, and at the end provide simple KPIs that can be loaded into the subject-oriented data warehouse.
Imagine a sales warehouse where we have data like: what have been sold, to whom, for which amount, etc. We can easily add to the data warehouse a new KPI, like number of positive and negative reviews in the social media for those products.


In this table we have the yellow coloured KPIs coming from our transactional sales system loaded into our data warehouse, and the green coloured ones were first processed by our big data solution, and then the results were also loaded into the data ware house.

Lets put some numbers, from the transactional system we loaded 100.000 transactions for the yellow columns, and for the green column big data processed 10.000.000 posts from Facebook and Twitter about the products in different parts of the world, and provided us with 4 records that are then loaded into the data warehouse.

Since companies have invested a lot of time in building and connecting their data warehouses to all their transactional systems, replacing them with new systems powered by big data is not a trivial task; at least for some years, I think both technologies will co-exist and need to be integrated.

Have a great Sunday !





viernes, 3 de octubre de 2014

Watch out SAP HANA! IBM BLU is here; now with support for DSOs PSAs, and Characteristics!




Update 22-Nov-2014: The post was removed from SDN by the moderators.

Update 7-Oct-2014: Follow the discussion on SDN:
Watch out SAP HANA! IBM BLU is here; now with support for DSOs PSAs, and Characteristics!


On December 2013, IBM and SAP announced that you can use IBM BLU acceleration for SAP BW Infocubes. This was trough the following SAP note:  1889656 - DB6: Mandatory SAP NW BW corrections for BLU Acceleration dated 04.12.2013.

On September 2014 they announced that you can use IBM BLU with SAP BW: Characteristics (Master Data), DSOs and PSAs thanks to the DB2 Cancún release. This was announced via SAP note: 1997314 - DB6: Enablement of BLU Acceleration for PSA, DSOs, and Characteristics InfoObjects date 29.09.2014.

I know that SAP HANA and IBM BLU are not directly comparable, but they share some interesting things like in-memory and columnar storage.

I´ve experimented with BLU for SAP BW only for Infocubes and you can notice the difference straight away.

You can find advantages and disadvantages in both approaches (HANA and BLU). In these times were economy is not on its best, I think IBM BLU is a really interesting option, specially since you can go live gradually, meaning one cube at the time and maybe using the same hardware or just upgrading it a little.

What are your thoughts?


Interested in BLU ? Reading this material from IBM is a great start: http://www.redbooks.ibm.com/redbooks/pdfs/sg248212.pdf (chapter 6)


martes, 9 de septiembre de 2014

¡Privacidad en mi correo electrónico!

Ayer, 8 de Septiembre 2014, descubrí una herramienta que puede ser muy útil, se llama Signals, desarrollada por la empresa Hubspot.

Esta herramienta permite al emisor de un correo electrónico saber cuando y cuantas veces el destinatario ha abierto o leído el correo electrónico. Veamos un ejemplo: Juan tiene instalado Signals (ustedes se lo pueden instalar también) y envía un email a Pedro. Juan es un vendedor; el momento en que Pedro lee el mensaje de Juan, posiblemente una oferta comercial, Juan es notificado por Signals de que Pedro acaba de leer su mensaje; Juan puede, por ejemplo, decidir llamar a Pedro en ese instante para cerrar el trato. Como ven, parece muy útil, pero si nos ponemos en el lugar de Pedro nos puede parecer "invasivo" el hecho de que alguien pueda saber cada vez que leemos nuestro correo; Ademas Signals mantendrá un bitácora de todas las veces que Pedro leyó el correo de  Juan.

Como muchas cosas en la vida, esto tiene dos lados, uno bueno y uno malo, dependiendo del rol que tengamos en esta historia.

A mi me tocó vivir el lado de Pedro y me sentí invadido, por suerte podemos protegernos fácilmente; a continuación les voy a mostrar como protegerse si usan Gmail, y después les explico como funciona Signals, de forma que si ustedes usan otra plataforma de correo electrónico puedan también protegerse.

Como detener a los usuarios de Signals cuando nuestro correo es Gmail

Para que nuestro Gmail no notifique a los usuarios de Signals que hemos abierto o leído sus correos podemos hacer lo siguiente:



En la página principal de Gmail, presionamos el engranaje (1) y luego hacemos click en "Settings" (2), es decir los puntos marcados por 1 y 2.

Luego aparecerá esta pantalla:


Elige la opción mostrada y luego no te olvides de presionar el botón de grabar al pie de la página.

¡Listo! Ahora los usuarios de Signals u otros servicios similares no podrán saber cuando leíste o abriste los correos que ellos te enviaron, al menos que en el mensaje recibido selecciones la opción mostrar imágenes, marcada con verde en la imagen siguiente:



Como funcionan los servicios como Signals

En el paso anterior mostramos como detener a Signals si usamos Gmail, si usas otra plataforma de correo electrónico, las opciones pueden ser diferentes, por eso te explico como funciona Signals.

Cuando el emisor se instala Signals, cada correo electrónico que este envía contiene una imagen oculta (invisible) en sus correos. Cuando el receptor abre (lee) el mensaje esta imagen es solicitada a uno de los servidores de Signals, de manera que Signals sabe que el correo fué abierto o léido. A pesar de que Gmail hace anónima la solicitud de esa imagen para proteger tu privacidad, en este caso no ayuda porque Signals hace cada imagen única por cada mensaje, de modo tal, que si una imagen es solicitada (por quien sea) Signals sabe a que mensaje pertenece a esa imagen. 

Este es un ejemplo del código que Signals inserta en los mensajes:

<img src=3D"http://t.signauxdix.com/img.gif?ukey=3DagxzfnNpZ25hbHNjcnhyGAsS=
C1VzZXJQcm9maWxlGICAwKO01PEJDA&amp;key=3Dbfc14f6a-f380-4332-cb49-6aa3d1a272=
4c" width=3D"1" height=3D"1" style=3D"display:none">




miércoles, 9 de julio de 2014

¿Será hora de buscar alternativas a Google?

Antes que nada, dejemos en claro que soy un fan de Google. Utilizo la mayoría de sus productos y estoy muy satisfecho con ellos.

El mayor porcentaje de los ingresos de Google viene de la publicidad, lo cual obviamente significa que Google invierte una inmensa cantidad de dinero en hacer que la publicidad sea cada vez más efectiva, básicamente creando un perfil en base a nuestras búsquedas. No tengo nada en contra de eso, pero sí me molesta el orden en que Google presenta los resultados de mis búsquedas. Es decir, si ayudo a un colega con su búsqueda de un teléfono Android; no significa que yo personalmente tenga un interes en esos teléfonos (yo uso Nokia Lumia!). Pero los avanzados algoritmos de Google pueden inferir que yo tengo un interés por ese tópico y darle prioridad en mis futuras búsquedas.

Si te preocupa tu privacidad o si estas cansado de que las búsquedas sean muy "personalizadas", te sugiero que pruebes por una semana este buscador: https://duckduckgo.com. Aquí no se personalizan tus búsquedas, ni tampoco se las guarda.

En este sitio (vale la pena verlo) puedes ver unos ejemplos de cómo Google crea un perfil en base a tus búsquedas.

Comparte tu experiencia.


lunes, 7 de julio de 2014

From virtualisation to containers, better than electric cars?

Spending a week in Berlin, after my last visit 23 years ago, made me think of the Commodore Amiga 500+. It was the year 91 and I was an exchange student in Germany. The Commodore Amiga was pretty popular there and I was able to test awesome games with my classmates, and off course I ended up buying my favorite ones, like Lemmings. Years passed and my Commodore stopped working, but I still wanted to play some of those great games, but it was not possible to buy an Amiga anymore (it was discontinued in 1992). Then I discovered something new: I could run Amiga games on a PC using something called an Emulator. Although Emulators and virtualisation are not the same, as the guys from Computer World explain here, for me it was the beginning of a journey into emulators and virtualisation.

Some years later a friend of mine had a specialised software that was really hard to configure, and every time his PC or the hard disk crashed (which happened very often) he needed to spend a lot of time and money configuring it all over again. It was then that the curse which haunts all of us who study software engineering (I was still at the university at that time), or anything related to information technology, descended over me: "Hey! you're studying something about computers, solve my problem!" my friend said.
The problem was straight forward: he wanted to configure the operating system and his software for the last time, and then move this "package" (meaning his specialised software and operating system already configured) to a new PC whenever his old one crashed, all done in an easy and practical way. Using the internet I learned about virtual PCs (VPs). To be able to use a virtual PC you need to install a software for the virtualisation, and that's where the magic starts. You execute the virtualisation software and in a window inside your desktop you will see as if a new computer is booting up; in this brand-new computer you need to install a new operating system, software, etc.; exactly as you do with a new physical computer. So, we installed my friend's software in a virtual PC, he could now copy the VP (usually a huge folder) to a new PC every time his old one died. Then he could start the virtual machine that contained his software and would be ready to continue working! Sounds like problem solved, right? well almost; now he was complaining about his software running slower. The solution was to buy more RAM for the PC, because now the hardware was running two operating systems, the base operating system that consumes a lot of memory, and the virtualisation software, which does not require a significant amount of memory by itself, but it contains another operating system called the guest operating system that has same memory requirements as the host.

This is the idea behind virtualisation: multiple virtual computers running on top of one hardware, all sharing and consuming the same physical resources like RAM memory, processor, etc.
It is very practical to have virtual machines that are hardware agnostic since they run on top of any hardware. It is allows to better exploit your hardware by running multiple machines on it; for example, you can have a virtual server for your financial operations, that are heavy during month's end and another virtual server for your logistic operations, that are intense in the middle of the month, this means you will be taking full advantage of your hardware during the whole month.
I recommend reading this article to understand all aspects of virtualisation.

So far I have learned that one of the positive points of virtualisation is that software runs independently of the type of hardware, but on the down side, every virtual machine needs an instance of an operating system that consumes resources.  In clouds, where the number of virtual machines is really big, the quantity of resources needed by the guest operating system also become considerable.

Near 2006 Linux introduced a very interesting solution for this problem: containers.
The idea behind containers is: on top of one physical host have only one operating system (no more waste of resources for each operating system on each virtual PC) that can run multiple instances of a program, and do it with certain level of isolation; meaning that each instance of the program believes it is running on a different machine, even with a different network address. Recently an implementation of containers called Docker (http://www.docker.com/) has been in the spotlight because companies like Google and Amazon are contributing to the project, and support this container technology in their own clouds.

The switch from virtualisation to containers, can save the world more energy than switching to electric cars, according to this article published by Wired.

Now that you have a glance of the difference between these 2 technologies, what is your opinion?

Image 1
A memory consumption comparison. Note that when using containers you don't have the Guest Os (Blue)







domingo, 29 de junio de 2014

Big data: It all started with Google and continues that way...

My 40th birthday is approaching and I start to think what has changed in the last decade specially in IT; this are my reflections about Big Data.

One decade ago, Google's people published some papers detailing a new way to analyze huge stores of information. Data was spread in "small" chunks across thousands of servers. When you asked a question, this query was processed by all those servers in parallel, and you got an answer, usually fast enough. They described this method as MapReduce.

Then Yahoo guys decided to implement MapReduce as an open source project called Apache Hadoop.  Now everything related to Big Data is somehow related to Hadoop which has been the hype term for Big Data for some years now. What does Hadoop do? I does MapReduce!

For Hadoop to make sense you have to have some nodes all inter-connected, so when you "ask" something your query is distributed among these nodes.

I think IBM has done a superb job explaining what MapReduce is, you can find it here. It will take 2 minutes to read, and in your next nerd cocktail party you will be able to talk about Hadoop with everyone.

Here is an extract of IBM's explanation:

"As an analogy, you can think of map and reduce tasks as the way a cen­sus was conducted in Roman times, where the census bureau would dis­patch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall popula­tion of the empire. This mapping of people to cities, in parallel, and then com­bining the results (reducing) is much more efficient than sending a single per­son to count every person in the empire in a serial fashion."

It sounds logical to think that Google processes more information that any other company in the world, and because of that, they create tools to handle huge volumes of data, and those tools seem to be 2-5 years ahead of all others. June 25, 2014 they announced a new technology called Google Cloud Dataflow; in short, Cloud Dataflow is a successor to MapReduce...

Still it is too soon (at least for me) to understand the main differences, but so far I think Big Data started with Google and they are still setting the pace of Big Data.

sábado, 12 de abril de 2014

¿Por qué Heartbleed es diferente?

Heartbleed no es un virus, ni tampoco un gusano, que son las amenazas de seguridad informática a las cuales estamos acostumbrados. Nos protegemos de ellas simplemente teniendo un antivirus y un sistema operativo actualizado (Windows, por ejemplo). En otras palabras, normalmente lo que tiene un problema de seguridad es nuestro computador. Pero en este caso, es totalmente diferente, son los computadores (servidores) de las grandes compañías los que tienen el problema, compañías como Google, por ejemplo.

No se sabe cuantos y cuales sitios han sido afectados, pero han sido bastantes. Personalmente, creo que más de uno ya es demasiado, aquí hay una lista de los principales sitios afectados.

Ahora entendamos cómo es que tantas compañías fueron afectadas. Un programa informático que es gratuito y de código abierto (creado y mantenido por un grupo de personas independientes) llamado OpenSSL es la causa. 

Dado que era gratuito y además considerado de lo más seguro, miles o hasta millones de empresas decidieron utilizarlo para cifrar las conexiones entre nuestros computadores y los de ellos (según la RAE, cifrar se define como: Transcribir en guarismos, letras o símbolos, de acuerdo con una clave, un mensaje cuyo contenido se quiere ocultar). Cada vez que en nuestro navegador web poníamos www.gmail.com y luego escribíamos nuestra contraseña, el programa OpenSSL, en los servidores de Gmail, se encargaba de cifrar el diálogo que existía entre nuestro computador y el de Google, para que nadie más (otros computadores o personas conectadas al Internet, por ejemplo) puedan ver cuál es nuestra clave o el contenido de nuestros correos electrónicos. Pero como suele suceder con programas informáticos, OpenSSL tenía un error, este error recientemente descubierto y bautizado Heartbleed permitía a otras personas ver el contenido de la conversación supuestamente cifrada entre nuestro computador y el de Gmail por ejemplo, esta conversación contiene nuestra contraseña y nuestros emails, en este caso.

Como puedes ver, nosotros como individuos no podemos hacer nada en nuestros computadores personales para subsanar esta vulnerabilidad, pero lo que si podemos hacer es lo siguiente: 

  • Una vez el proveedor de servicios (Banco, Email, Tienda, etc.) confirme que han configurado en sus servidores la última versión de OpenSSL, la cual no contiene el error, cambiar nuestras contraseñas lo antes posible.
Algunas empresas son muy abiertas y honestas confesando que tienen el problema e informan a todos sus clientes o suscriptores por email acerca de los riesgos, y también confirman cuando ya han subsanado el problema, y que uno debe cambiar la contraseña; no ignoremos estos mensajes y cambiemos nuestras contraseñas.
  • También podemos analizar si un sitio tiene el problema o no usando esta herramienta, y si lo tiene, contactarlo y exigir que lo arreglen o cerrar nuestra cuenta.

Como ven, este incidente de seguridad es único y creo que cambiará la historia de la seguridad informática. 

Feliz fin de semana cambiando sus contraseñas!


jueves, 9 de enero de 2014

Google's searching capabilities! Looking for a new job for example.

Don’t forget about google’s searching capabilities! Looking for a new job for example?


Google has a lot of powerful searching operators. For example, if your are looking for a new job, and you have noticed that that a lot of companies post those opportunities in taleo.net you can run this query in google: site:taleo.net BW, this will find all BW jobs in taleo.net. Spend some time improving your google search skills for free in this google made course: http://www.google.com/insidesearch/landing/powersearching.html