Map Reduce

Posted nativestack

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Map Reduce相关的知识,希望对你有一定的参考价值。

In this page, I will explain the following important MR concepts.

1) Job: how the job is inited , executed.

2) MR components: How they work to process data.

3) Data Flow: Word counter to illustrate point 2.

4) Shuffle: the whole picture.

Job

技术分享图片

Job submitter computes the input splits, submits the job.

JobScheduler initializes the job object, creating map tasks,reduce tasks,jobsetup task and jobCleanUp task. 

For each input split, one map task is created.  a split is a block which is the default. But a split can be configured to be multiple blocks.

number of reduce task is configured in the MR program.

JobTracker assigns these tasks to TaskTrackers based on Data locality optimization and how much load is on the trasktracker. Tasktracker sends HB to job trackers . HB inclues the resource available(cpu,memory....).  It also sends the progress to job tracker periodically , and it will forward to the application so , the progress is known.

TaskTracker will run these tasks on separate JVM in case the application bring down the task tracker due to fault.

jobsetup task is run firstly to create output folder, then map tasks are run, then reduce tasks run. finally, job cleanup task is run. the job is successfuly only job cleanup task is run finished without problem. It renames the folder to _success. Before that, all the output is stored in temp folder and copied to working folder by job cleanup task. Reduce tasks knows the output of the map task from the job tracker which assigns the map tasks.  When to run reduce task can be configured based on what percentage of the map task is completed . 

Map output is written to local FS. Reduce output is written to HDFS(as we want to be reliable).

技术分享图片

Components

Below is very clear to describe the data flow among the components. For some job , like map side join, there is no need to run the reduce task . So, not all the final output is generated by reduce task.

技术分享图片

Data Flow

Some thing to note:

1) the client / application/ submiter specifies the number of reduce tasks which will determine how many partitions.

2) Values(or records) with the same key will be in the same partition, and later, will go into the same reducer. 

3) The records in the partition are sorted in memory(what algorithm?)

 

技术分享图片

Shuffle in all

  • Map output is not directly written to disk but to a memory buffer(100MB default). Writing to disk for each record is very slow.  What‘s more, sort can not be done. When the bufer is 80% full, a background thread will start to do the partition and then sorting in the memory and then dump the part of the data to disk as a file(immutable split) while the map task is not blocked unless the buffer is full.
  • Before writing the data(partitioned & sorted) to disk, If needed, you can set a combiner ,for this example, it has the same logic with reducer , to optimize the program(save / transfer less data). See the changes to the third one that is the values(each is 1) are summed based on the key. In a result, the number of output records to disk is reduced to 4(originally 6).

技术分享图片

  • As we know, hdfs file does not support modification(hdfs file is Immutable). We call each of this immutable file as a split. As the map task continues to output records/<key-value> pairs to the memory buffer, there will be more and more splits generated/dumped. A background thread will merge those splits (merge 10 splits by default one time) into a single large split which is still partitioned,sorted, combined. When the map task is done, all the data will be merged into a single file(per partition??? on behalf of reducer?) and copied to local FS(ext3/4). Sure, you can compress it to save I/O and network transermission to reducer tasks. They will be removed once job is completed.

 

技术分享图片

 

  • Reducers run without waiting all map task to complete. It runs as soon as there is map output available to copy. 5 threads by default copies the map output to memory buffer. If it is 80% full, just like before, splits files starts to be written to disk. Before writting to a split to disk , it is merged, sorted, combined in memory(Because the data is from multiple map output with the same partition, there possible be records with same keys).  Multiple splits are also merged/sorted/combined concurrently by the background thread in reduce task. Finally, the larger splits will be consumed by reduce function. You can see the sort/combine are executed in both map and reduce side.

 

技术分享图片

以上是关于Map Reduce的主要内容,如果未能解决你的问题,请参考以下文章

MIT6.824 - 01 MapReduce

关于reduce的理解

Python--高阶函数及其装饰器

2021年大数据常用语言Scala(二十七):函数式编程 聚合操作

Sass Maps的函数-map-values($map)map-merge($map1,$map2)

map和map0.5的区别