Tuesday, February 5, 2013

Using MAP-REDUCE – to Process Large Data Sets

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. The MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the typical functional programming map and reduce combination.

No comments:

Post a Comment