Apache Hadoop (CDH 5) - Tutorial I (OverView)
This tutorial will show how to use hadoop with CDH 5 cluster on EC2.
We have 4 EC2 instances, one for Name node and three for Data nodes.
To see how to setup CDH5 on EC2, please visit Apache Hadoop Install - CDH5.
Hadoop MapReduce is a software framework for writing applications that process vast amounts of data. The data processing is done in parallel on large clusters. The clusters consist of thousands of nodes of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input dataset into independent chunks. Then, they are processed by the map tasks in parallel. The Hadoop MapReduce framework sorts the outputs of the maps. Then, they are used as input to the reduce tasks. Typically both the input and the output of the job are stored in a distributed filesystem.
Typically, the MapReduce framework and the Hadoop Distributed File System (HDFS) are running on the same set of nodes. In other words, the compute nodes and the storage nodes are the same. Putting them into the same node allows the framework to effectively schedule tasks on the nodes where data is already present. This way we can get very high aggregate bandwidth across the cluster.
The MapReduce runtime framework consists of a single JobTracker and one TaskTracker per cluster node. The JobTracker is responsible for scheduling the jobs' component tasks on the TaskTracker nodes, monitoring the tasks, and re-executing failed tasks. The TaskTracker nodes execute the tasks as directed by the JobTracker:
The applications specify the input and output locations and supply map and reduce functions. The functions implements appropriate interfaces and/or abstract classes. These locations and functions and other job parameters comprise the job configuration. Then, the Hadoop job client submits the job (such as an executable) and configuration to the JobTracker. The JobTracker, then distributes the software and configuration to the TaskTracker nodes. The JobTracker also responsible for scheduling tasks and monitoring them, and providing status and diagnostic information to the job client.
The Hadoop framework is implemented in Java. So, we may want to develop MapReduce applications in Java or any JVM-based language or use one of the following interfaces:
- Hadoop Streaming - a utility that allows you to create and run jobs with any executables (for example, shell utilities) as the mapper and/or the reducer.
- Hadoop Pipes - a SWIG-compatible (not based on JNI) C++ API to implement MapReduce applications.
In this tutorial, we'll use org.apache.hadoop.mapred Java API. Note that there is a newer Java API, org.apache.hadoop.mapreduce.
Big Data & Hadoop Tutorials
Hadoop 2.6 - Installing on Ubuntu 14.04 (Single-Node Cluster)
Hadoop 2.6.5 - Installing on Ubuntu 16.04 (Single-Node Cluster)
Hadoop - Running MapReduce Job
Hadoop - Ecosystem
CDH5.3 Install on four EC2 instances (1 Name node and 3 Datanodes) using Cloudera Manager 5
QuickStart VMs for CDH 5.3
QuickStart VMs for CDH 5.3 II - Testing with wordcount
QuickStart VMs for CDH 5.3 II - Hive DB query
Scheduled start and stop CDH services
CDH 5.8 Install with QuickStarts Docker
Zookeeper & Kafka Install
Zookeeper & Kafka - single node single broker
Zookeeper & Kafka - Single node and multiple brokers
OLTP vs OLAP
Apache Hadoop Tutorial I with CDH - Overview
Apache Hadoop Tutorial II with CDH - MapReduce Word Count
Apache Hadoop Tutorial III with CDH - MapReduce Word Count 2
Apache Hadoop (CDH 5) Hive Introduction
CDH5 - Hive Upgrade to 1.3 to from 1.2
Apache Hive 2.1.0 install on Ubuntu 16.04
Apache Hadoop : HBase in Pseudo-Distributed mode
Apache Hadoop : Creating HBase table with HBase shell and HUE
Apache Hadoop : Hue 3.11 install on Ubuntu 16.04
Apache Hadoop : Creating HBase table with Java API
Apache HBase : Map, Persistent, Sparse, Sorted, Distributed and Multidimensional
Apache Hadoop - Flume with CDH5: a single-node Flume deployment (telnet example)
Apache Hadoop (CDH 5) Flume with VirtualBox : syslog example via NettyAvroRpcClient
List of Apache Hadoop hdfs commands
Apache Hadoop : Creating Wordcount Java Project with Eclipse Part 1
Apache Hadoop : Creating Wordcount Java Project with Eclipse Part 2
Apache Hadoop : Creating Card Java Project with Eclipse using Cloudera VM UnoExample for CDH5 - local run
Apache Hadoop : Creating Wordcount Maven Project with Eclipse
Wordcount MapReduce with Oozie workflow with Hue browser - CDH 5.3 Hadoop cluster using VirtualBox and QuickStart VM
Spark 1.2 using VirtualBox and QuickStart VM - wordcount
Spark Programming Model : Resilient Distributed Dataset (RDD) with CDH
Apache Spark 1.2 with PySpark (Spark Python API) Wordcount using CDH5
Apache Spark 1.2 Streaming
Apache Spark 2.0.2 with PySpark (Spark Python API) Shell
Apache Spark 2.0.2 tutorial with PySpark : RDD
Apache Spark 2.0.0 tutorial with PySpark : Analyzing Neuroimaging Data with Thunder
Apache Spark Streaming with Kafka and Cassandra
Apache Drill with ZooKeeper - Install on Ubuntu 16.04
Apache Drill - Query File System, JSON, and Parquet
Apache Drill - HBase query
Apache Drill - Hive query
Apache Drill - MongoDB query
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization