Back in 2003 Google was struggling with the indexing of World Wide Web as it reached the limits of RDBMS technology. This struggle leads to a new approach to store and analyse large quantity of data in the form of Google File System. Further research lead to MapReduce architecture for data processing of large files in 2008.

Developer at Yahoo named Doug Cutting is in process of making open-source search engine Nutch and facing same problems. Based on Google research he started an open-source project named Hadoop(named after his son’s toy elephant).

In order to process large amounts of data we need to scale our data storage. There are 2 options available.

As commodity hardware is cheaper so horizontal scaling is the way to go for Hadoop.

Hadoop is considered Swiss Army Knife of 21st century and comes with complete ecosystem in the form of Hadoop App Store. Let’s check out few of the apps provided by Hadoop.

I want to store my data? Hadoop Distributed File System(HDFS)

  • Distributed File System for redundant storage.
  • Designed to store data on commodity hardware.
  • Built to expect hardware failures.
HDFS Architecture

I love programming? MapReduce
Programming model for distributed computations at a massive scale.

  1. Iterate over large number of records.
  2. Create something of interest from it.
  3. Shuffle and sort intermediate results
  4. Aggregate intermediate results.
  5. Generate final output.

First 3 steps are included in map phase and last 2 are in reduce phase. You can add additional steps in between if needed.

WordCount Using MapReduce

I am a scripting guy? Apache Pig
High level data flow language. Use commands like shell scripting. Compiler translate Pig Latin to MapReduce.

Apache Pig WorkFlow

I am more of an SQL person? Apache Hive
Allows analysis & queries using SQL like language.

Apache Hive Architecture

Still large number of apps are available in Hadoop Ecosystem serving different purposes.

Hadoop Ecosystem

That’s a short explanation of Hadoop and type of functionality available inside it. In today’s world as commodity hardware is getting cheaper Hadoop is getting more popular even on the enterprise level.