Tuesday, May 7, 2013

Big Data - Hadoop Overview

Starting to look into Big Data area and feel that I better look into Hadoop.

Big Data

Recently, everyone talk about big data and I am asking myself what is big data? Big databases and file system?

Gartner define Big Data as "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

High volume refers to the ever glowing volume of data transaction. Data are in petabytes nowadays.

High velocity refers to the speed of processing required to process the data. Time-sensitive data analysis may requires speed of processing of data into seconds.

High variety refers the various type of data. Nowadays, data is not necessary to be structured. Unstructured data such as video stream, mouse click, sensor information, etc... are very common.

So, my understanding to big data is any data that is under the sun.

Why Hadoop?

This is the next question in my mind. Why Hadoop? Organization has been happily using tradition data warehouse and RDBMS to handle "big data" over the decade and RDBMS is proven to be ACID. After some reading, my conclusion is RDBMS is very well designed for structured data and analyzing reasonable size of normalized dataset. These structured data are often stored in a single server that had 1 big harddisk and multiple CPU core.

However, in the current world, 80% of the data are unstructured and distributed globally. In such environment, RDBMS become very inefficient. Some reasons are

1. Moving terabytes of data across system for RDBMS centralized system for data analysis are infeasible
2. Processing these massive data with limited bandwidth and hardware resources at the RDBMS centralized system is very time consuming.

So, if you encounter the above headache with your RDBMS system, it is time for you to consider Hadoop.

Hadoop is
  • Cheap as it can be run on commodity hardware
  • Scalable as servers can be added with required
  • Distributed as no single node store all data
  • Parallel as each nodes run job with their set of data in parallel
  • Suitable to handle massive amount of data that are unstructured and semi-structured
  • Reliable as it replicates data into multiple nodes.
  • Code based as each Hadoop server could run different job code on their specific set of data
  • Network efficient as only processed results at each slave node are sent to master node for result compilation as a whole 
Below are some architectures that I had found in Google that may help to understand more about Hadoop

To archived the above, Hadoop has 2 main components: HDFS and MapReduce.

I need to read more before I had more insight and post about HDFS and MapReduce.


No comments:

Post a Comment

PuTTY - Keeping session alive

If you have some VPN tunnels that timeout based on activity, you could try the following setting if you are using PuTTY 1) Open PuTTY and...