Big Data - Hadoop Overview

May 07, 2013

Big Data - Hadoop Overview

Starting to look into Big Data area and feel that I better look into Hadoop.

Big Data

Recently, everyone talk about big data and I am asking myself what is big data? Big databases and file system?

Gartner define Big Data as "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

High volume refers to the ever glowing volume of data transaction. Data are in petabytes nowadays.

High velocity refers to the speed of processing required to process the data. Time-sensitive data analysis may requires speed of processing of data into seconds.

High variety refers the various type of data. Nowadays, data is not necessary to be structured. Unstructured data such as video stream, mouse click, sensor information, etc... are very common.

So, my understanding to big data is any data that is under the sun.

Why Hadoop?

This is the next question in my mind. Why Hadoop? Organization has been happily using tradition data warehouse and RDBMS to handle "big data" over the decade and RDBMS is proven to be ACID. After some reading, my conclusion is RDBMS is very well designed for structured data and analyzing reasonable size of normalized dataset. These structured data are often stored in a single server that had 1 big harddisk and multiple CPU core.

However, in the current world, 80% of the data are unstructured and distributed globally. In such environment, RDBMS become very inefficient. Some reasons are

1. Moving terabytes of data across system for RDBMS centralized system for data analysis are infeasible
2. Processing these massive data with limited bandwidth and hardware resources at the RDBMS centralized system is very time consuming.

So, if you encounter the above headache with your RDBMS system, it is time for you to consider Hadoop.

Hadoop is

Cheap as it can be run on commodity hardware
Scalable as servers can be added with required
Distributed as no single node store all data
Parallel as each nodes run job with their set of data in parallel
Suitable to handle massive amount of data that are unstructured and semi-structured
Reliable as it replicates data into multiple nodes.
Code based as each Hadoop server could run different job code on their specific set of data
Network efficient as only processed results at each slave node are sent to master node for result compilation as a whole

Below are some architectures that I had found in Google that may help to understand more about Hadoop

To archived the above, Hadoop has 2 main components: HDFS and MapReduce.

I need to read more before I had more insight and post about HDFS and MapReduce.

Reference:
http://en.wikipedia.org/wiki/Apache_Hadoop
http://www.ebizq.net/blogs/enterprise/2009/09/10_ways_to_complement_the_ente.php
http://www-01.ibm.com/software/ebusiness/jstart/hadoop/

Comments

Charlotte W.January 24, 2020 at 1:54 AM
Hadoop is designed to store and manage a large amount of data. There are many advantages of Hadoop like it is free and open source, easy to use, its performance etc. but on the other hand, it has some weaknesses which we called as disadvantages. Thank you!
==================================
https://astarios.com/iot-and-m2m-software-solutions/
ReplyDelete
Replies
veera cynixitSeptember 23, 2020 at 11:57 PM
Very nice article,keep sharing more article about big data and hadoop.
thank you....

big data and hadoop online training

hadoop admin online course
ReplyDelete
Replies

Add comment

Search This Blog

Thompson's Technological Insight

Big Data - Hadoop Overview

Comments

Post a Comment

Popular Posts

MP4 File Format Part 1

WebSphere - Error 404: SRVE0190E: File not found: {0}