Hadoop MapReduce Java API

Posted 2021-02-16 gonens

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Hadoop MapReduce Java API相关的知识，希望对你有一定的参考价值。

　　Input：

　　　　输入为 InputFormat产生的 InputSplit

　　　　The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job

　　　　设定mapper的API，需要提供map函数

　　　　pass mapper to Job

　　　　and call map(WritableComparable, Writable, Context) for each key/value pair in the InputSplit.

　　　　( hint: Applications can then override the cleanup(Context) method to perform any required cleanup. )

　　　　设定分组时的判别方式（用于map之后的分组）

　　　　设定Combiner

　　The number of maps = the total number of blocks of the input files

　　每个节点并行执行10-100个map tasks比较合理

Reducer

　　Job.setNumReduceTasks(int)

　　　　设定reduce tasks 数

　　Job.setReducerClass(Class)

　　　　意义类似map

　　shuflle&sort: 将map的输出按key排序后分片

　　　　Job.setSortComparatorClass(Class)　　

　　　　　　设定分组时的判别方式（用于多个map outputs merge的分组）

　　　　The output of the Reducer is not sorted.

　　The right number of reduces: 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).

　　( 0.95: 所有reduces在所有map结束后可以立刻开始迁移数据和处理

　　 1.75: 更有效的保持迁移数据和reduce处理的平衡（虽然增加了开销，但保持了迁移和处理的平衡，降低了失败时的额外开销）

　　　*略小于整数是为了给失败的task留一点slot )

如果reduce task数设置为0，则直接将map之后output放入文件系统中的给定位置。

Partitioner决定如何给reduce之前的结果分片，默认为HashPartitioner

可以使用Counter 来 report map/reduce 的统计信息。

以上是关于Hadoop MapReduce Java API的主要内容，如果未能解决你的问题，请参考以下文章