MapReduce:Simplified Data Processing On Large Clusters
Posted 涛涌四海
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了MapReduce:Simplified Data Processing On Large Clusters相关的知识,希望对你有一定的参考价值。
MapReduce:Simplified Data Processing On Large Clusters
Jeffrey Dean and Sanjay Ghemawat
jeff@google.com , sanjay@google.com
google,Inc.
Abstract
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function the processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the rogram's execution across a set of machines, handling machine failures, and managing the reuired inter-machine communication. This allows programmers without any experience with parallel and dis tributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce comutation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
1 Introduction
Over the past five years, the authors and many other at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as arawled documents, web request logs, etc., to compute various kinds of derived data,such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straight forward. However, the input data is usually large and the commutations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to paralleize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of comlex code to deal with these issues.
As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other funtional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with users-pecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
The mejor contibutions of this work are a simple and powerful interface that enables automatic parallelization and dis tribution of large-scale computations, combined with an implementation fo this interface that achieves high performance on large clusters of commodity PCs.
Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has preformance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis for a rewrite of our production indexing system. Section 7 discusses related and future work.
2 Programming Model
The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.
Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator.This allows us to handle lists of values that are too large to _t in memory.
2.1 Example
Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The reduce function sums together all counts emitted for a particular word.
In addition, the user writes code to _ll in a mapreduce speci_cation object with the names of the input and output _les, and optional tuning parameters. The user then invokes the MapReduce function, passing it the speci_-cation object. The user's code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this example.
2.2 Types
Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types:
map (k1,v1) -> list(k2,v2)
reduce (k2,list(v2)) -> list(v2)
I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore, the intermediate keys and values are from the same domain as the output keys and values.
Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types.
2.3 More Examples
Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a < URL, total count> pair.
ReverseWeb-Link Graph: The map function outputs <target, source> pairs for each link to a target
URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>
Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair.
Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.
Distributed Sort: The map function extracts the key from each record, and emits a <key, record> pair. The reduce function emits all pairs unchanged. This computation depends on the partitioning facilities described in Section 4.1 and the ordering properties described in Section 4.2.
3 Implementation
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large NUMA multi-processor, and yet another for an even larger collection of networked machines.
This section describes an implementation targeted to the computing environment in wide use at Google: large clusters of commodity PCs connected together with switched Ethernet [4]. In our environment:
(1) Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used - typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth.
(3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks attached directly to individual machines. A distributed file system [8] developed in-house is used to manage the data stored on these disks. The file system uses replication to provide availability and reliability on top of unreliable hardware.
(5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user.
Figure 1 shows the overall flow of a MapReduce operation in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure 1 correspond to the numbers in the list below):
1. The MapReduce library in the user program first splits the input files into M pieces of typically 16
megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
2. One of the copies of the program is special - the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses
key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls
to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. After successful completion, the output of the mapreduce execution is available in the R output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these R output
files into one file - they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.
译文:
摘 要
Mapreduce 是一个能够处理和产生大量数据的级联系统和编程模型。用户编制的map函数处理一个key/value对,并产生一系列的中间key/value对,reduce函数汇聚所有具有相同key的中间value值。很多现实的任务就是用这种方式解决的,这也是本文所要展示的。
以这种模型编写的程序,于普通机器构成的大型集群系统中协调执行。分时系统负责输入数据的分割,安排这些程序平衡执行于这些设备,协调处理失误,和负责必需的中间设备的通信。这就允许程序员在没有任何并行和分布式系统经验的情况下轻松地利用大型的分布式系统。
我们MapReduce成果运行于由普通机器构建的大系统中,并且正快速扩展:一个典型的MapReduce计算系统在数以千计的设备系统上处理数T的数据。程序员发现这系统很容易处理:百计的MapReduce程序系统已经实现,并且每天就有有一千个以上MapReduce任务在Google的分布式系统上执行。
1 概述
在过去的五年,作者和很多其他的Google公司同事完成了百计的处理大量数据,具有特殊目的的计算,如爬行文本,web的访问日志,等等,计算统计各种结论数据,例如反向索引,web文档的各种图形结构表现,每个主机网络爬虫文档数量汇总,每天被请求数量最多查询的集合,等等。通常这些计算不复杂,然而输入数据量巨大,要想在能够接受的合理时间内完成,只有将计算分布运行于成百上千的设备上。如何进行并行计算,如何分发数据,如何处理错误?因需要大量代码解决这些问题使得原本简单的计算变得复杂。
为了解决上述复杂的问题,我们设计一个新的抽象模型,使用这个抽象模型,我们只要表述我们想要执行的简单运算即可,而不必关心并行计算、容错、数据分布、负载均衡等复杂的细节,这些问题都被封装在了一个库里面。设计这个抽象模型的灵感来自Lisp和许多其他函数式语言的Map和Reduce的原语。我们意识到大多数的运算都包含这样的操作:在输入数据的“逻辑”记录上应用Map操作得出一个中间key/value 对集合,然后在所有具有相同key值的value值上应用Reduce操作,从而达到合并中间的数据,得到一个想要的结果的目的。使用MapReduce模型,再结合用户实现的Map和Reduce函数,我们就可以非常容易的实现大规模并行化计算;通过MapReduce模型自带的“再次执行”功能,也提供了初级的容灾实现方案。
这个工作(实现一个MapReduce框架模型)的主要贡献是通过简单的接口来实现自动的并行化和大规模的分布式计算,通过使用MapReduce模型接口实现在大量普通的PC机上高性能计算。
第二部分描述基本的编程模型和一些使用案例。第三部分描述了一个经过裁剪的、适合我们的基于集群的计算环境的MapReduce实现。第四部分描述我们认为在MapReduce编程模型中一些实用的技巧。第五部分对于各种不同的任务,测量我们MapReduce实现的性能。第六部分揭示了在Google内部如何使用MapReduce作为基础重写我们的索引系统产品,包括其它一些使用MapReduce的经验。第七部分讨论相关的和未来的工作。
2 编程模型
以上是关于MapReduce:Simplified Data Processing On Large Clusters的主要内容,如果未能解决你的问题,请参考以下文章
MapReduce:Simplified Data Processing On Large Clusters
paper学习MapReduce:SimplIFied Data Processing on Large Clusters