Mapreduce - 保留输入顺序

Posted

技术标签:

【中文标题】Mapreduce - 保留输入顺序【英文标题】:Mapreduce - retain input order 【发布时间】:2018-04-10 02:37:32 【问题描述】:

具有由管道分隔的数字列表的文件可以有重复项。需要编写 map reduce 程序来列出原始输入顺序中不重复的数字。能够删除重复项,但不保留输入顺序。

【问题讨论】:

为什么不使用 Hive、Pig 或 Spark 来执行此操作?每个人都可以在不到 10 行代码中做到这一点 好,然后将数据强制到一个reducer并排序 不清楚是否给您提供了已经排序的输入数据,或者您被要求保留输入数据的确切顺序,只是没有重复 是的,需要保留输入顺序。基本上输入数据没有任何顺序。 到目前为止,我的输出类似于 (key=number, val= order_position)。我想我需要让它 key=order_position, val=number 以便它保留输入顺序。 [在映射器中,我将 order_position 序列号分配给数字] 【参考方案1】:

很简单,假设你的文字是:

Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.
Line 5 -> But his face you could not see,
Line 6 -> The Quangle Wangle sat,

Line 23line 56 重复。

映射器应该类似于wordcount 程序,其中映射器的输入类似于

键值对:

(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
(113, But his face you could not see,)
(146, The Quangle Wangle sat,)

映射器的输出

(NullWritable, 0_On the top of the Crumpetty Tree)
(NullWritable, 33_The Quangle Wangle sat,)
(NullWritable, 57_But his face you could not see,)
(NullWritable, 89_On account of his Beaver Hat.)
(NullWritable, 113_But his face you could not see,)
(NullWritable, 146_The Quangle Wangle sat,)

现在,确保你只有一个 reducer,这样单个 reducer 的输入就是

reducer 的输入

Key: NullWritable
Iterable<value>: [(0_On the top of the Crumpetty Tree), 
(33_The Quangle Wangle sat,), 
(57_But his face you could not see,), 
(89_On account of his Beaver Hat.), 
(113_But his face you could not see,), 
(146_The Quangle Wangle sat,)]

请注意,reducer 的输入是按升序顺序排序的,在这种情况下,它会保持原始顺序,因为offsetTextInputFormat 中的行始终是ascending 顺序.

在reducer中,只需遍历列表,清除重复项并在删除开头的offset_分隔符后写入行。减速器输出类似于:

Reducer 键值对

NullWritable, value.split("_")[1]

reducer 的输出

Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.

【讨论】:

以上是关于Mapreduce - 保留输入顺序的主要内容,如果未能解决你的问题,请参考以下文章

解决 hive maPredue转换hivesql出错Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.a

MapReduce之单词计数

MapReduce并行编程模型

MapReduce的分区

深入理解MapReduce的架构及原理

为什么要用MapReduce