Hadoop¶
About¶
Hadoop is a scalable, distributed computing solution provided by Apache. Similar to queuing systems, Hadoop allows for distributed processing of large data sets.
Workflow¶
Downloading the Hadoop Job¶
These steps help setup the Hadoop environment and download a spreadsheet of data which will Hadoop will sort into sales units per region.
Download and source Hadoop environment variables:
[flight@gateway1 (scooby) ~]$ wget https://tinyurl.com/hadoopenv [flight@gateway1 (scooby) ~]$ source hadoopenv
Create job directory:
[flight@gateway1 (scooby) ~]$ mkdir MapReduceTutorial [flight@gateway1 (scooby) ~]$ chmod 777 MapReduceTutorial
Download job data:
[flight@gateway1 (scooby) ~]$ cd MapReduceTutorial [flight@gateway1 (scooby) MapReduceTutorial]$ wget -O hdfiles.zip https://tinyurl.com/hdinput1 [flight@gateway1 (scooby) MapReduceTutorial]$ unzip -j hdfiles.zip
Check that job data files are present:
[flight@gateway1 (scooby) MapReduceTutorial]$ ls desktop.ini hdfiles.zip SalesCountryDriver.java SalesCountryReducer.java SalesJan2009.csv SalesMapper.java
Preparing the Hadoop Job¶
Compile java for job:
[flight@gateway1 (scooby) MapReduceTutorial]$ javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java
Create a manifest file:
[flight@gateway1 (scooby) MapReduceTutorial]$ echo "Main-Class: SalesCountry.SalesCountryDriver" >> Manifest.txt
Compile the final java file for job:
[flight@gateway1 (scooby) MapReduceTutorial]$ jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class
Starting the Hadoop Environment¶
Start the Hadoop distributed file system service:
[flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-dfs.sh
Start the resource manager, node manager and app manager service:
[flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-yarn.sh
Create directory for processing data and copy sales results in:
[flight@gateway1 (scooby) MapReduceTutorial]$ mkdir ~/inputMapReduce [flight@gateway1 (scooby) MapReduceTutorial]$ cp SalesJan2009.csv ~/inputMapReduce/
Load the data into the distributed file system:
[flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -ls ~/inputMapReduce
Running the Hadoop Job¶
Execute the MapReduce job:
[flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar ~/inputMapReduce ~/mapreduce_output_sales
View the job results:
[flight@gateway1 (scooby) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -cat ~/mapreduce_output_sales/part-00000 | more