KTechITKnowledge: Hadoop Introduction

Hadoop Introduction

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture Hadoop has two major layers namely:
(a) Processing/Computation layer (MapReduce), and
(b) Storage layer (Hadoop Distributed File System).

MapReduce MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware.
It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
It is highly fault-tolerant and is designed to be deployed on low-cost hardware.
It provides high throughput access to application data and is suitable for applications having large datasets.
Apart from the above-mentioned two core components,
Hadoop framework also includes the following two modules:
1.Hadoop Common: These are Java libraries and utilities required by other Hadoop modules.
2.Hadoop YARN: This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative,
you can tie together many commodity computers with single-CPU,
as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server.
So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines. Hadoop runs code across a cluster of computers.

This process includes the following core tasks that Hadoop performs:

1. Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).
2. These files are then distributed across various cluster nodes for further processing.
3. HDFS, being on top of the local file system, supervises the processing.
4. Blocks are replicated for handling hardware failure.
5. Checking that the code was executed successfully.
6. Performing the sort that takes place between the map and reduce stages.
7. Sending the sorted data to a certain computer.
8. Writing the debugging logs for each job. Advantages of Hadoop.

1. Hadoop framework allows the user to quickly write and test distributed systems.
2. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
3. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
4. Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
5. Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
Installation Hadoop Hadoop is supported by GNU/Linux platform and its flavors.
Therefore, we have to install a Linux operating system for setting up Hadoop environment.
In case you have an OS other than Linux, you can install a Virtualbox software in it and have Linux inside the Virtualbox.
Pre-installation Setup Before installing Hadoop into the Linux environment, we need to set up Linux using ssh (Secure Shell).
Follow the steps given below for setting up the Linux environment.

Hadoop Downloading Download and extract Hadoop
2.4.1 from Apache software foundation using the following commands.
$ su password: # cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf
hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Modes of Hadoop Operation
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three supported modes:
1. Local/Standalone Mode: After downloading Hadoop in your system, by default, it is configured in a standalone mode and can be run as a single java process.
2. Pseudo Distributed Mode: It is a distributed simulation on single machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development. 3. Fully Distributed Mode: This mode is fully distributed with minimum two or more machines as a cluster. We will come across this mode in detail in the coming chapters. Installing Hadoop in Standalone Mode Here we will discuss the installation of Hadoop

2.4.1 in standalone mode.
There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.
Setting Up Hadoop. You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop Before proceeding further, you need to make sure that Hadoop is working fine.
Just issue the following command: $ hadoop version
If everything is fine with your setup,
then you should see the following result:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is configured to run in a non-distributed mode on a single machine.
HDFS OVERVIEW Hadoop File System was developed using distributed file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from possible data losses in case of failure.

HDFS also makes applications available to parallel processing. Features of HDFS It is suitable for the distributed storage and processing.
1. Hadoop provides a command interface to interact with HDFS.
2. The built-in servers of namenode and datanode help users to easily check the status of cluster.
3. Streaming access to file system data.
4. HDFS provides file permissions and authentication. HDFS follows the master-slave architecture and it has the following elements.
Namenode The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software.
It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
1. Manages the file system namespace.
2. Regulates client’s access to files.
3. It also executes file system operations such as renaming, closing, and opening files and directories.
Datanode The datanode is a commodity hardware having the GNU/Linux operating system and datanode software.
For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
1. Datanodes perform read-write operations on the file systems, as per client request.
2. They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
Block Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks.
In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS Fault detection and recovery: Since HDFS includes a large number of commodity hardware, failure of components is frequent.

Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

Huge datasets: HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

Hardware at data: A requested task can be done efficiently, when the computation takes place near the data.
Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes.
Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial.
But, once we write an application in the MapReduce form, scaing the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change.
This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm

1. Generally MapReduce paradigm is based on sending the computer to where the data resides!
2. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
A) Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
B) Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will be stored in the HDFS.

3. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
4. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
5. Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
6. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective) The MapReduce framework operates on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job: (Input) -> map -> -> reduce -> (Output). Terminology
1. PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
2. Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
3. NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
4. DataNode - Node where data is presented in advance before any processing takes place.
5. MasterNode - Node where JobTracker runs and which accepts job requests from clients.
6. SlaveNode - Node where Map and Reduce program runs.
7. JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
8. Task Tracker - Tracks the task and reports status to JobTracker.
9. Job - A program is an execution of a Mapper and Reducer across a dataset.
10. Task - An execution of a Mapper or a Reducer on a slice of data.
11. Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode. Hadoop Distributions Hadoop Distributions aim to resolve version incompatibilities

• Distribution Vendor will – Integration Test a set of Hadoop products – Package Hadoop products in various installation formats
• Linux Packages, tarballs, etc. – Distributions may provide additional scripts to execute Hadoop – Some vendors may choose to backport features and bug fixes made by Apache – Typically vendors will employ Hadoop committers so the bugs they find will make it into Apache’s repository.

Distribution Vendors

• Cloudera Distribution for Hadoop (CDH)
• MapR Distribution
• Hortonworks Data Platform (HDP)
• Apache BigTop Distribution
• Greenplum HD Data Computing Appliance Cloudera Distribution for Hadoop (CDH)
• Cloudera has taken the lead on providing Hadoop Distribution – Cloudera is affecting the Hadoop eco-system in the same way RedHat popularized Linux in the enterprise circles
• Most popular distribution – http://www.cloudera.com/hadoop – 100% open-source
• Cloudera employs a large percentage of core Hadoop committers
• CDH is provided in various formats – Linux Packages, Virtual Machine Images, and Tarballs Cloudera Distribution for Hadoop (CDH)
• Integrates majority of popular Hadoop products – HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig, Sqoop, Whirr, Zookeeper, Flume
• CDH4 is used in this class Supported Operating Systems • Each Distribution will support its own list of Operating Systems (OS)
• Common OS supported – Red Hat Enterprise – CentOS – Oracle Linux – Ubuntu – SUSE Linux Enterprise Server
• Please see vendors documentation for supported OS and version – Supported Operating Systems for CDH4: https://ccp.cloudera.com/display/CDH4DOC/Before+You+Install+CDH4+on+a+Cl uster#BeforeYouInstallCDH4onaCluster-SupportedOperatingSystemsforCDH4
Thanks For read My Article.Any Query Comment Below.

Hadoop Introduction

Hadoop Introduction

No comments:

Post a Comment

Artificial Intelligent-IV

Report Abuse