Showing posts from August, 2016

Hadoop - How to setup a Hadoop Cluster

Below is a step-by-step guide which I had used to setup a Hadoop Cluster
3 VMs involved:

1) NameNode, ResourceManager - Host name:
2) DataNode 1 - Host name:
3) DataNode 2 - Host name:

1) You could create a new Hadoop user or use an existing user. But make sure the user have access to the Hadoop installation in ALL nodes

2) Install JAVA. Refer here for a good version. In this guide, Java is installed at /usr/java/latest

3) Download a stable version of Hadoop from Apache Mirrors

This guide is based on Hadoop 2.7.1 and assume that we had create a user call hadoop

Setup Passphaseless SSH from NameNode to all Nodes.
1) Run the command

This command will ask you a set of questions and accepting the default is fine. Eventually, it will create a set of private key (id_rsa) and public key ( at the user directory (/home/hadoop/.ssh)

2) Copy the public key to all Nodes with the following

ssh-copy-id -i /home/h…

JAVA - _JAVA_OPTIONS and JAVA_TOOL_OPTIONS environment variable

JAVA_TOOL_OPTIONS and _JAVA_OPTIONS are 2 useful environment variables which allow user to set JVM options in the form of environment variables, rather than setting it at the command line. But, they have slight differences

1. Precedence - From my testing, the precedence (order of evaluation) is


With this, there is different use-case for _JAVA_OPTIONS and JAVA_TOOL_OPTIONS

For _JAVA_OPTIONS, you could use it to overwrite the JVM options which has been defined in the command line.

For JAVA_TOOL_OPTIONS, you could use it to put additional JVM options for the predefined command line.

2. Documentation - JAVA_TOOL_OPTIONS is well documented but _JAVA_OPTIONS. So, _JAVA_OPTIONS may not be officially supported.

3. Support - _JAVA_OPTIONS is Oracle specific. The IBM Java equivalent will be IBM_JAVA_OPTIONS. JAVA_TOOL_OPTIONS is platform independent.