Journey of Hadoop

History of Hadoop:

At the outset of twenty-first century, somewhere 1999-2000, due to increasing popularity of XML and JAVA, internet was evolving faster than ever. This leads to the invention of Hadoop.

Requirement is mother of invention:

As the world wide web grew at dizzying pace, though current search engine technologies were working fine, a better open source search engine was the need of the hour to cater the future growth of the internet.

Project Nutch at Apache Open source organization was a realization of this growing need by Doug Cutting and Mike Cafarella.  Doug was Internet archive search director and Mike was a graduate student at University of Washington.

The hard work:

It took an year of work for both of them before it started working and hundreds of millions of pages had been indexed with the limitations that it could only run across a handful of machines. Further, it requires monitoring round the clock to avoid failure.

Google’s adventure:

While this duo was serious at their work, they get impressed by Google during 2003-2004 when Google come up with two Paper “Google File System (GFS) ” and “MapReduce” which led them add DFS and MapReduce in Nutch.  The best part was they could just set it running on between 20 and 40 machines. But still miles to go.

Reunion at Yahoo!!

Meanwhile, Yahoo!, equally impressed with GFS and MapReduce, want to build an open source technology using these two features.  Yahoo’ Eric Baldeschwieler (E14) hired Doug with many others and a new project got split off Nutch.  Doug Got responsibility for Apache Open Source liaison and E14 was looking engineers, clusters, users etc.

The Name – Hadoop:

This new project was named as Hadoop by Doug on his son’s yellow elephant toy which he used to call Hadoop

Finally with all great leadership and the resources mobilized by Yahoo!, Hadoop reached web scale in 2008.

The ability:

Since its inception, Hadoop has become one the most talked about technologies for its ability to store and process huge amount of structured and unstructured data quickly.

It enables distributed processing for big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks.

  1. Massive data storage
  2. Faster processing


Because of the limits of vertical scalability, Hadoop is designed to horizontally scale up from a single server to thousands of commodity machines and off-course with very high fault tolerance.

The Advantages:

It introduced new dimensions in large scale computing with all of its new features.  These can be listed down:

  1. Scalable: Any cluster of Hadoop can be expanded by just adding new server or commodity machine without even thinking of any complexity like reformat, move and with little administration.
  2. Fault Tolerant:  Data and application processing are protected against hardware failure.  When the cluster looses a node, the system automatically redirects the work of that particular node to another node so that computing does not fail.
  3. Cost Effective:  As it is open source, it is free.  Further, it brings massive parallel computing by just using commodity servers.  This not only reduces the cost of processing but also the cost to per terabyte storage of data.
  4. Storage Flexibility:  Unlike traditional relational database, Hadoop is schema-less and can absorb any type of data whether structured or unstructured from any number of sources.
    It also provides features to join and aggregate data from even different sources which make it a powerful analysis tool.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.