MRJob在Python中排序

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了MRJob在Python中排序相关的知识,希望对你有一定的参考价值。

我有一项任务,要求我在python中使用mapper / reducer来完成客户数据的MapReduce。我有一个CSV文件,其中包含CustomerID,ProductID和已用金额。第一项任务是确定每个客户的总花费,这是我轻松完成的。下一部分要求我采用此列表,并按降序排列的总金额进行排序。我在这里苦苦挣扎...建议在另一个MapReduce之上使用MapReduce。这是我的代码:

PART 1:

from mrjob.job import MRJob

class TotalAmountCust(MRJob):

    def mapper(self, _, line):
        (customerid, idno, amount) = line.split(',')
        yield customerid, float(amount)

    def reducer(self, customerid, amount):
        yield customerid, sum(amount)

if __name__ == '__main__':
    TotalAmountCust.run()

PART 2:

from mrjob.job import MRJob
from mrjob.step import MRStep

class TotalAmountCustSort(MRJob):
    def steps(self):
        return [ MRStep(mapper = self.map_by, reducer = self.red_by),
                MRStep(mapper = self.map_sort, reducer = self.red_sort) ]
    def map_by(self, _, line):
        (customerid, idno, amount) = line.split(',')
        yield customerid.zfill(2), float(amount)
    def red_by(self, customerid, amount):
        yield customerid, '%04.02f' % sum(amount)
    def map_sort(self, customerid, total):
        yield float(total), customerid
    def red_sort(self, total, customerid):
        yield total, customerid
if __name__ == '__main__':
    TotalAmountCustSort.run()

第2部分有问题,根本不会给我一个结果。任何建议都会被推荐......我尝试研究MRJob.SORT_VALUES = True,但这并没有给我我希望得到的结果。

答案

我解决了,输出现在订购了

from mrjob.job import MRJob
from mrjob.step import MRStep

class SpendByCustomerSorted(MRJob):

    MRJob.SORT_VALUES = True

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_orders,
                   reducer=self.reducer_totals_by_customer),
            MRStep(mapper=self.mapper_make_amounts_key,
                   reducer=self.reducer_output_results_for_single_reducer)
        ]
    def mapper_get_orders(self, _, line):
        (customerID, itemID, orderAmount) = line.split(',')
        yield customerID, float(orderAmount)

    def reducer_totals_by_customer(self, customerID, orders):
        yield customerID, sum(orders)

    def mapper_make_amounts_key(self, customerID, orderTotal):
        yield None, ("%07.02f"%float(orderTotal), customerID)

    def reducer_output_results(self, n, orderTotalCustomerIDs):
        for c in orderTotalCustomerIDs:
            yield c[1], c[0]

if __name__ == '__main__':
    SpendByCustomerSorted.run()

以上是关于MRJob在Python中排序的主要内容,如果未能解决你的问题,请参考以下文章

MapReduce 实战

Hadoop学习笔记:使用Mrjob框架编写MapReduce

2.9 MRJob编写和运行MapReduce

Python mrjob mapreduce如何预处理输入文件

MRJob 极速入门,Python玩转Hadoop你会么?

python 用mrjob在python中做mapreduce