Friday, April 19, 2013

What is MapReduce?

Its a framework to write programs that allow to process massive amount of unstructured data in parallel across a distributed cluster of computers. It achieves this using 2 functions: 

1.  Map: This function routes chunk of processing job across various nodes in the cluster. 
2.  Reduce: This function collates the work from various nodes, and resolves the results into a single value. 


This framework is said to be fault tolerant. How it achieves it, is by listening to various nodes to reply from time to time.. If a certain node does not reply in some amount of time, the node is thought to be dead. The work that was assigned to that node, is then reassigned to a different node. This way the detection and repair of any failures on the nodes is done at the application side. 

MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. MapReduce is important because it allows ordinary developers to use MapReduce library routines to create parallel programs without having to worry about programming for intra-cluster communication, task monitoring or failure handling. It is useful for tasks such as data mining, log file analysis, financial analysis and scientific simulations.

There are various implementations of MapReduce in market today. Hadoop being one of the leading MapReduce implementation in market. The Hadoop project provides end to end Big Data Services.

No comments:

Post a Comment