Skip to main content

What is Apache Hadoop?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.




Before jumping into Apache Hadoop, Read about BigData. click here

So let's Talk about the History of Apache Hadoop



  • Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
  •  It was originally developed to support distribution for the Nutch search engine project.
  •  Doug, who was working at Yahoo! at the time and is now Chief Architect of Cloudera, named the project after his son's toy elephant.
  •  Cutting's son was 2 years old at the time and just beginning to talk.
  •  He called his beloved stuffed yellow elephant "Hadoop". Now, Doug's son often exclaims, "Why don't you say my name, and why don't I get royalties? I deserve to be famous for this!"



Hadoop Architecture


At its core, Hadoop has two major layers namely -

  1. Processing/Computation layer (MapReduce), and
  2. Storage layer (Hadoop Distributed File System-HDFS).


MapReduce

In simple terms MapReduce means Map the data and reduce
It has 3 stages
  • Map stage 
  • Shuffle stage
  • Reduce stage 


Map stage - The map or mapper’s job is to process the input data. Generally, the input data is in the form of a file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

Shuffle - Sorting and Segregate 

Reduce stage - This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

Example of MapReduce 




HDFS


It is suitable for distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.

The built-in servers of name node and data node help users to easily check the status of the cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.





Working of HDFS 






Online courses






For more info, you can read 

https://hadoop.apache.org (official site)

x

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. The post is really well written and presents an interesting piece of information, thank you. I was looking on Apache Hadoop for long. Now, I can make time to go through this and apply the information wherever I can. It will really help me to complete my assignment on Hadoop Training.

    ReplyDelete

Post a Comment

Popular posts from this blog

What is BlockChain ?

The blockchain is a chain of blocks which contain information. First used in Bitcoin by " Satoshi Nakamoto " in  2009 for Bitcoins(A digital currency) The blockchain is a distributed ledger. Now, what is ledger? According to Wikipedia ledger is A  ledger   is the principal book or computer file for recording and totaling economic transactions measured in terms of a monetary  unit of account  by account type, with  debits and credits  in separate columns and a beginning monetary  balance  and ending monetary balance two account. In simple terms a file which stores information like transactions and account details called Block and the connection between them is Chain. A block contains 3 part  Hash of the previous block Data Hash   Block of Blockchain Hash is a  like a fingerprint, which contains details of data(sender, receiver, content, previous block hash etc) Hash looks like this "a 0680c0...

What is Big Data?

Big Data is the huge amount of data which can't be stored and processed using the traditional methods within the given time frame. So, The Question arises that how much big should be the big data . Generally, people think that the data whose size is more than GB, TB, PB is big data. But, it's not the case. some data which is small in size can be a big data. For example 100mb of a document is to be sent by email(we generally use Gmail ), but it's not possible because Gmail  doesn't support this feature.  That's why  100 MB of the document can be referred to as a big data for email service. let's understand bigdata with another Example 1TB of data is given a person, 1 TB contains images which he has to edit and process in a particular amount of Time, for a normal user it will be a Bigdata. Some analysis of data in the real world Facebook-100TB /day twitter- 4400 twites /day LinkedIn - 10TB/day Google+ - 10TB/...

What is Cloud Computing ?

Cloud Computing  is the delivery of services. It enables the user to Store data, Access service, and Share platforms.  Cloud Computing Cloud Computing services contain Data Storage   Server Database Networking etc Cloud Computing Services are divided into 3 types IaaS – infrastructure as a service It provides you Infrastructure for computing , physically or in the form of virtual machine. Service provider  - Azure , Rackspace , Amazon etc                 (Best Jobs in 2020) Example:-  let’s suppose you have a laptop of 4 Gb RAM and you wanted to install Android Studio, but your system doesn't support it well. Your laptop performance is lagging. Your friend Suraj have one extra laptop which is not in use and it has 8 GB  RAM,  so You borrow his laptop for 30 days, You completed your work and you return the laptop, this is called IaaS. ...