Hadoop Cluster Setup Using Ansible

Published by Anubhav Singh on

2.5 quintillions bytes of data, we are producing daily. And to manage this extensive amount of data Hadoop cluster is the solution.

In Hadoop, we use Namenode and Datanode cluster setup working on Hadoop Distributed File System(HDFS) design.

In this article, we are going to learn Hadoop cluster setup using Ansible. Let me give you a little brief about what are Hadoop and Ansible?

What is Hadoop?

Apache Hadoop is an open-source software library that allows for the distributed processing of a large amount of data.  Hadoop uses the MapReduce algorithm and its design is known as HDFS(Hadoop Distributed File System).  

In the Hadoop cluster, there is one Name node where all other nodes contribute their storage and those nodes are Data nodes.

For more details about Hadoop and Big Data, Read This.

What is Ansible?

Ansible is an open-source software provisioning, configuration management, and application-deployment tool. Which provides Infrastructure as code.

Task Overview

Hadoop cluster setup using Ansible

  • Setup inventory file
  • Create required files
  • Create roles
  • Configure Name node
  • Configure Data node
  • Run Final Playbook

Setup inventory file

In Ansible, the inventory file is the file that holds the information about all the configuring nodes. It holds the info including the IP, username, and password of the node.

In our case, we have two nodes- The name node and the Data node.

[namenode]
<ip> ansible_ssh_user=<username> ansible_ssh_pass=<password>

[datanode]
<ip> ansible_ssh_user=<username> ansible_ssh_pass=<password>

e.g.

[datanode]
192.168.43.12 ansible_ssh_user=root ansible_ssh_pass=redhat 

Create required files

Namenode files

To configure the namenode, we are required to have two files configured in our system those are core-site.xml and hdfs.xml.

Code for core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://nn-ip:port</value>
</property>	
</configuration>

Replace nn-ip with your namenode ip and provide a port number(9001 I used in my case).

Code for hdfs.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>      
<name>dfs.name.dir</name>
<value>/mynn</value>
</property>
</configuration>

Inside the <value> tag replace /mynn with your folder name. In my case it is /mynn.

Datanode files

Code for core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://nn-ip:port</value>
</property>	
</configuration>

Use namenode ip and same port number configured before in namenode file.

Code for hdfs.xml

  
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>      
<name>dfs.name.dir</name>
<value>/mynn</value>
</property>
</configuration>

It is similar with namenode.

Save all these files locally in your system. Future we will copy these files in our namenode and datanode.

Create roles

 Roles are better way to manage the playbook code. Where tasks, variables, templates are in different directory that is more recommended.

In our case, we are going to create two roles-

  • NameNode
  • DataNode

To create the roles we have command

ansible-galaxy init <role_name>

e.g.

ansible-galaxy init NameNode

Create two roles one NameNode and one DataNode.

Configure Name node

Steps for task file:

  • Upload Hadoop and JDK software
  • Install JDK and Hadoop
  • Creating a directory
  • Copying core file and HDFS file
  • Formatting the directory
  • Initialize the namenode
  • Print result

Open the below mentioned file in one of your editor

NameNode/tasks/main.yml

I am using vim as my text editor. Below is the code for above all the steps.

- name: 'To Upload Hadoop Software'
      copy:
        src: "{{ hadoop_software }}"
        dest: "{{ software_dest }}"

    - name: 'To Upload JDK Software'
      copy:
        src: "{{ jdk_software }}"
        dest: "{{ software_dest }}"

    - name: 'JDK Installation'
      shell: rpm -ivh "{{ jdk_software }}"
           
 
    - name: 'Hadoop Installation'
      shell: rpm -ivh "{{ hadoop_software }}" --force
       

    - name: 'Creation of a Directory for Namenode'
      file:
        state: directory
        path: "{{ namenode_folder }}"

    - name: 'Copying Core-file'
      template:
        src: "{{ core_site_src }}"
        dest: "{{ core_site_dest }}"

    - name: 'Copying HDFS file...'
      template:
        src: "{{ hdfs_src }}"
        dest: "{{ hdfs_dest }}"

    - name: "Formatting the Directory..."
      shell: "echo Y | hadoop namenode -format"
       

    - name: "Initializing namenode"
      shell: "hadoop-daemon.sh start namenode" 

    - name: "Result"
      shell: "jps"
      register: output

    - debug:
        var: output

For hadoop installation, jdk is the pre-requisite and that is why we first upload the software there and then install it one by one.

We are required to have a directory where all the datanode will contribute their storage and that is why we created a directory.

To configure the Hadoop, we already have prepared two files. We have to copy the files there.

To continue the work on directory we need to format that.

Finally, we are ready to start the service and lastly print the output.

Step for vars file:

Set this according to your variables. Open the below mentioned file location

NameNode/vars/main.yml

Paste all your variables here, In my case-

hadoop_software: "/root/hadoop-1.2.1-1.x86_64.rpm"
software_dest: "/root"
jdk_software: "/root/jdk-8u171-linux-x64.rpm"
core_site_src: "/ansible/task11/NNcore-site.xml"
core_site_dest: "/etc/hadoop/core-site.xml"
hdfs_src: "/ansible/task11/NNhdfs.xml"
hdfs_dest: "/etc/hadoop/hdfs-site.xml"
namenode_folder: "/mynn"

Configure Datanode

Steps for tasks file:

  • Upload Hadoop and JDK software
  • Install JDK and Hadoop
  • Copying core file and HDFS file
  • Initialize the datanode
  • Print result

Open the below mentioned file in one of your editor

DataNode/tasks/main.yml
- name: 'Hadoop Software'
    copy:
      src: "{{ hadoop_software }}"
      dest: "{{ software_dest }}"

  - name: 'JDK Software'
    copy:
      src: "{{ jdk_software }}"
      dest: "{{ software_dest }}"

  - name: 'JDK Installation'
    shell: rpm -ivh "{{ jdk_software }}"
   

  - name: 'Hadoop Installation'
    shell: rpm -ivh "{{ hadoop_software }}" --force

  - name: 'Copying core-file'
    template:
      src: "{{ core_site_src }}"
      dest: "{{ core_site_dest }}"

  - name: 'Copying HDFS file'
    template:
      src: "{{ hdfs_src }}"
      dest: "{{ hdfs_dest }}"

  - name: "Initializing Datanode"
    shell: "hadoop-daemon.sh start datanode"

  - name: "Result"
    shell: "jps"
    register: output

  - debug:
     var: output

Most of the things are quite similar with the Data node.

Now create the vars file this as well. Open file-

DataNode/vars/main.yml

Set this according to your variables, in my case here is the file-

  
hadoop_software: "/root/hadoop-1.2.1-1.x86_64.rpm"
software_dest: "/root"
jdk_software: "/root/jdk-8u171-linux-x64.rpm"
core_site_src: "/ansible/task11/DNcore-site.xml"
core_site_dest: "/etc/hadoop/core-site.xml"
hdfs_src: "/ansible/task11/DNhdfs.xml"
hdfs_dest: "/etc/hadoop/hdfs-site.xml"

Now the complete coding part is done. You can now check if both systems are connected or not in order to set up the Hadoop cluster. Use command –

ansible all -m ping

This will give an output in green text with success, means all the system are connected.

Now, we need a final playbook to run both the roles one by one and here is that-

- hosts: namenode
  roles:
  -  NameNode

- hosts: datanode
  roles:
  -  DataNode 

Now run this final playbook;

ansible playbook <file-name>

for e.g.

ansible playbook task.yml

This will give the output like this-

That’s all our Hadoop cluster setup is done.

name node setup of hadoop cluster
data node setup of hadoop cluster

Final voice

In this article, we have set up a Hadoop cluster using the ansible. All the configuration of the Data node and the Name node is fully automated. Try this practical in your system and comment down below how useful it is for you. I have also covered some more amazing projects using ansible, don’t forget to check that out – Ansible Projects.

To get the complete code, visit – GitHub

You can connect with the author on LinkedIn.

you can also subscribe to our YouTube channel for further videos. You can also visit our website for such interesting information and articles.

If you want to about what is Linux operating system you can on this link.

Thank you for reading this post. If you have any queries just comment down, we will surely come to you back and stay connected with us for such interesting posts.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *