What is Apache Hadoop?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Before jumping into Apache Hadoop, Read about BigData. click here

So let's Talk about the History of Apache Hadoop

Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
It was originally developed to support distribution for the Nutch search engine project.
Doug, who was working at Yahoo! at the time and is now Chief Architect of Cloudera, named the project after his son's toy elephant.
Cutting's son was 2 years old at the time and just beginning to talk.
He called his beloved stuffed yellow elephant "Hadoop". Now, Doug's son often exclaims, "Why don't you say my name, and why don't I get royalties? I deserve to be famous for this!"

Hadoop Architecture

At its core, Hadoop has two major layers namely -

Processing/Computation layer (MapReduce), and
Storage layer (Hadoop Distributed File System-HDFS).

MapReduce

In simple terms MapReduce means Map the data and reduce

It has 3 stages

Map stage
Shuffle stage
Reduce stage

Map stage - The map or mapper’s job is to process the input data. Generally, the input data is in the form of a file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

Shuffle - Sorting and Segregate

Reduce stage - This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

Example of MapReduce

HDFS

It is suitable for distributed storage and processing.

Hadoop provides a command interface to interact with HDFS.

The built-in servers of name node and data node help users to easily check the status of the cluster.

Streaming access to file system data.

HDFS provides file permissions and authentication.

Working of HDFS

Online courses

https://www.edx.org/learn/hadoop

https://intellipaat.com/big-data-hadoop-training

https://dzone.com/articles/top-5-hadoop-courses-to-learn-online

For more info, you can read

https://www.sas.com/en_us/insights/big-data/hadoop.html#hadooptechnical

https://www.sas.com/en_us/insights/big-data/hadoop.html

https://hadoop.apache.org (official site)

Comments

maddisonApril 24, 2019 at 5:07 AM
This comment has been removed by the author.
ReplyDelete
Replies
maddisonApril 30, 2019 at 1:26 AM
The post is really well written and presents an interesting piece of information, thank you. I was looking on Apache Hadoop for long. Now, I can make time to go through this and apply the information wherever I can. It will really help me to complete my assignment on Hadoop Training.
ReplyDelete
Replies

Add comment

Tech Hacks And Lifestyle

Search This Blog