Wednesday, September 26, 2007

What the Hadoop is going on?

I recently started researching Hadoop, which is a framework for running applications with very large data sets and/or high computational requirements across large numbers of commodity hardware. Large data sets in this case range from TeraBytes to PetaBytes and large numbers of commodity hardware range in the thousands! :) If you are wondering how you could get over a 1000 pcs in one location have a peek at the pic below:



Commodity hardware implies your average pc and related hardware made for consumers. So basically these types of frameworks give you massive processing capabilities by networking heaps of regular PCs.

Hadoop is an implementation of MapReduce. MapReduce is a technique which facilitates the fragmentation of very large processing into smaller fragments (Maps) which can be executed in parallel across the large number of machines. The intermediate results of this processing are further simplified - the reduce.

Hadoop is based on Google's MapReduce implementation. Google's MapReduce further runs over GFS (Google File System) which is a distributed file system. Hadoop has its own MapReduce implementation and HDFS (Hadoop Distributed File System) which it runs over. The distributed file systems are needed for scalability, performance and fault tolerance.

If you want to give Hadoop a go there is an excellent tutorial on how to set up single node and multinode Hadoop clusters on Ubuntu.

No comments: