
Copyright © National Academy of Sciences. All rights reserved.
The Future of Computing Performance:   Game Over or Next Level?
THE END OF PROGRAMMING AS WE KNOW IT  125
MapReduce: Exploiting Data Parallelism 
and Distributed Computation
MapReduce is a data-processing infrastructure
19
 developed internally 
by Google and later popularized in the Hadoop open-source version.
20
 
MapReduce  is  targeted at large-scale  data-parallel  workloads in  which 
the input is represented as a set of key-value pairs and computation is 
expressed as a sequence of two user-provided functions: map and reduce. 
The Map function processes the input to create an intermediate key-value 
pair. Intermediate pairs with the same key are aggregated by the system 
and fed to the reduce function, which produces the final output. 
What  makes  MapReduce  particularly  compelling  is  that  it  frees 
the  programmer  from  the  need  to  worry  about  much  of  the  complex-
ity  required  to  run  large-scale  computations.  The  programmer  needs 
only to produce the body of the two MapReduce methods, and the sys-
tem takes  care of parallelization,  data  distribution,  fault tolerance, and 
load-balancing.
The class of problems that can be mapped to this relatively simple 
processing framework is surprisingly broad. MapReduce was conceived 
to simplify  the data-processing  involved in  creating  a  Web index  from 
a set of crawled documents, but developers have also used it for large-
scale graph algorithms (for example, finding citation occurrences among 
scientific articles), computing query-term popularity (for example, Google 
Trends
21
), and creating language models for statistical machine transla-
tion (for example, finding and keeping a count for every unique sequence 
of five or more terms in a large corpus), and other applications. Within 
Google itself, MapReduce’s popularity has exploded since its introduction.
MapReduce has also been found to be useful for systems much smaller 
than the large distributed clusters for which it was designed. Ranger et al. 
examined the adequacy of the MapReduce framework for multicore and 
multiprocessor systems
22
 and found that it was equally compelling as a 
programming system for this class of machines. In a set of eight applica-
tions  both  coded  with  a  version  of  MapReduce  (the  Phoenix  runtime) 
and hand-coded directly, the authors found that MapReduce-coded ver-
19 
See Jeffrey Dean and Sanjay Ghemawat, 2008, MapReduce: Simplified data processing on 
large clusters, Communications of the ACM 51(1): 107-113, and Micheal Noth, 2006, Building 
a computing system for the world’s information, Invited talk, University of Iowa, IT Tech 
Forum Presentations, April 20, 2006.
20 
See The Apache Hadoop project, available at http://hadoop.apache.org. 
21 
See Google trends, available at http://trends.google.com.
22 
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, 
2007, Evaluating MapReduce for multi-core and multiprocessor systems, Proceedings of the 
IEEE 13th International Symposium on High-Performance Computer Architecture, Phoenix, 
Ariz., February 10-14, 2007.