Hadoop

About

Hadoop is a scalable, distributed computing solution provided by Apache. Similar to queuing systems, Hadoop allows for distributed processing of large data sets.

Workflow

Installing Hadoop Manually to Shared Filesystem

  • Install dependencies for Hadoop (press ‘y’ to confirm the installation when prompted):

    [flight@gateway1 (scooby) ~]$ sudo yum install java-1.8.0-openjdk.x86_64 java-1.8.0-openjdk-devel.x86_64
    
  • Download Hadoop v3.2.1:

    [flight@gateway1 (scooby) ~]$ wget -O /tmp/hadoop.tgz http://tiny.cc/hadoop321
    
  • Decompress the Hadoop installation to shared storage:

    [flight@gateway1 (scooby) ~]$ cd /opt/apps
    [flight@gateway1 (scooby) ~]$ tar xzf /tmp/hadoop.tgz
    
  • Edit line 54 in /opt/apps/hadoop-3.2.1/etc/hadoop/hadoop-env.sh to point to the Java installation as follows:

    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/jre
    

Downloading the Hadoop Job

These steps help setup the Hadoop environment and download a spreadsheet of data which will Hadoop will sort into sales units per region.

  • Download and source Hadoop environment variables:

    [flight@gateway1 (scooby) ~]$ wget https://tinyurl.com/hadoopenv
    [flight@gateway1 (scooby) ~]$ source hadoopenv
    
  • Create job directory:

    [flight@gateway1 (scooby) ~]$ mkdir MapReduceTutorial
    [flight@gateway1 (scooby) ~]$ chmod 777 MapReduceTutorial
    
  • Download job data:

    [flight@gateway1 (scooby) ~]$ cd MapReduceTutorial
    [flight@gateway1 (scooby) MapReduceTutorial]$ wget -O hdfiles.zip https://tinyurl.com/hdinput1
    [flight@gateway1 (scooby) MapReduceTutorial]$ unzip -j hdfiles.zip
    
  • Check that job data files are present:

    [flight@gateway1 (scooby) MapReduceTutorial]$ ls
    desktop.ini  hdfiles.zip  SalesCountryDriver.java  SalesCountryReducer.java  SalesJan2009.csv  SalesMapper.java
    

Preparing the Hadoop Job

  • Compile java for job:

    [flight@gateway1 (scooby) MapReduceTutorial]$ javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java
    
  • Create a manifest file:

    [flight@gateway1 (scooby) MapReduceTutorial]$ echo "Main-Class: SalesCountry.SalesCountryDriver" >> Manifest.txt
    
  • Compile the final java file for job:

    [flight@gateway1 (scooby) MapReduceTutorial]$ jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class
    

Starting the Hadoop Environment

  • Start the Hadoop distributed file system service:

    [flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-dfs.sh
    
  • Start the resource manager, node manager and app manager service:

    [flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-yarn.sh
    
  • Create directory for processing data and copy sales results in:

    [flight@gateway1 (scooby) MapReduceTutorial]$ mkdir ~/inputMapReduce
    [flight@gateway1 (scooby) MapReduceTutorial]$ cp SalesJan2009.csv ~/inputMapReduce/
    
  • Load the data into the distributed file system:

    [flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -ls ~/inputMapReduce
    

Running the Hadoop Job

  • Execute the MapReduce job:

    [flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar ~/inputMapReduce ~/mapreduce_output_sales
    
  • View the job results:

    [flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -cat ~/mapreduce_output_sales/part-00000 | more