在python中尝试使用mapreduce的程序,并需要一些帮助

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在python中尝试使用mapreduce的程序,并需要一些帮助相关的知识,希望对你有一定的参考价值。

为了获得更多实践经验,我想尝试项目字数。 这是我的样本数据。

联合国(UN)是一个政府间组织,成立于1945年10月24日,旨在促进国际合作。作为无效国际联盟的替代品,该组织在第二次世界大战之后成立,以防止再发生此类冲突。

[...]

我使用以下python代码来获取我的结果

from mrjob.job import MRJob

from mrjob.step import MRStep



class MovieRatings(MRJob):

    def steps(self):

        return [

            MRStep(mapper=self.mapper_get_ratings,

                   reducer=self.reducer_count_ratings),

  ]



    def mapper_get_ratings(self, _, line):

        (word) = line.split(' ')

        yield word, 1



    def reducer_count_ratings(self, key, values):

        yield Key, sum(values)


if __name__ == '__main__':

    MovieRatings.run()

我在Python 2中遇到以下错误

[root@localhost Desktop]# python RatingsBreakdown.py UN.txt
Traceback (most recent call last):
  File "RatingsBreakdown.py", line 1, in <module>
    from mrjob.job import MRJob
  File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 1106
    for k, v in unfiltered_jobconf.items() if v is not None
      ^
SyntaxError: invalid syntax

还有Python 3

[root@localhost Desktop]# python3 RatingsBreakdown.py UN.txt
No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 2...
Creating temp directory /tmp/RatingsBreakdown.training.20171128.083536.602598
Error while reading from /tmp/RatingsBreakdown.training.20171128.083536.602598/step/000/mapper/00000/input:
Traceback (most recent call last):
  File "RatingsBreakdown.py", line 25, in <module>
    RatingsBreakdown.run()
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 424, in run
    mr_job.execute()
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 445, in execute
    super(MRJob, self).execute()
  File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 185, in execute
    self.run_job()
  File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 233, in run_job
    runner.run()
  File "/usr/lib/python3.4/site-packages/mrjob/runner.py", line 511, in run
    self._run()
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 144, in _run
    self._run_mappers_and_combiners(step_num, map_splits)
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 185, in _run_mappers_and_combiners
    for task_num, map_split in enumerate(map_splits)
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 120, in _run_multiple
    func()
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 662, in _run_mapper_and_combiner
    run_mapper()
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 685, in _run_task
    stdin, stdout, stderr, wd, env)
  File "/usr/lib/python3.4/site-packages/mrjob/inline.py", line 92, in invoke_task
    task.execute()
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 433, in execute
    self.run_mapper(self.options.step_num)
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 517, in run_mapper
    for out_key, out_value in mapper(key, value) or ():
  File "RatingsBreakdown.py", line 13, in mapper_get_ratings
    (userID, movieID, rating, timestamp) = line.split('	')
ValueError: need more than 1 value to unpack

还有我的MovieRatings

[root@localhost Desktop]# python3 MovieRatings.py UN.txt
No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/MovieRatings.training.20171128.083635.368889
Error while reading from /tmp/MovieRatings.training.20171128.083635.368889/step/000/reducer/00000/input:
Traceback (most recent call last):
  File "MovieRatings.py", line 20, in <module>
    MovieRatings.run()
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 424, in run
    mr_job.execute()
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 445, in execute
    super(MRJob, self).execute()
  File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 185, in execute
    self.run_job()
  File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 233, in run_job
    runner.run()
  File "/usr/lib/python3.4/site-packages/mrjob/runner.py", line 511, in run
    self._run()
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 150, in _run
    self._run_reducers(step_num, num_reducer_tasks)
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 246, in _run_reducers
    for task_num in range(num_reducer_tasks)
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 120, in _run_multiple
    func()
  File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 685, in _run_task
    stdin, stdout, stderr, wd, env)
  File "/usr/lib/python3.4/site-packages/mrjob/inline.py", line 92, in invoke_task
    task.execute()
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 439, in execute
    self.run_reducer(self.options.step_num)
  File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 560, in run_reducer
    for out_key, out_value in reducer(key, values) or ():
  File "MovieRatings.py", line 17, in reducer_count_ratings
    yield Key, sum(values)
NameError: name 'Key' is not defined

我想解决这个错误并理解你的错误是什么。

答案

好像这个库只适用于Python3

  File "RatingsBreakdown.py", line 13, in mapper_get_ratings
    (userID, movieID, rating, timestamp) = line.split('	')
ValueError: need more than 1 value to unpack

首先,您运行RatingsBreakdown.py ...此外,您显示的输入不包含选项卡,您尝试提取4列。不清楚你在这里的期望。

  File "MovieRatings.py", line 17, in reducer_count_ratings
    yield Key, sum(values)
NameError: name 'Key' is not defined

自我解释......你的变量是小写的key

另一答案

您正在尝试本课程中的示例(link)吗?

根据这个issue,mrjob已经因Python 2.6而下降。

我如何修复它是在VM正在使用的CentOS上安装Python 2.7(refernece)。然后设置pip(reference)并再次安装mrjob。

现在一切都运行这个:

python2.7 RatingsBreakdown.py u.data
另一答案

我的工作相同,但没有使用步骤功能。有效。

from mrjob.job import MRJob

class wordcount(MRJob):

    def mapper(self, _, line):
        (word) = line.split(' ')
        yield word, 1

    def reducer(self,x,count):
        yield x,sum(count)

if __name__ == '__main__':
  wordcount.run()

以上是关于在python中尝试使用mapreduce的程序,并需要一些帮助的主要内容,如果未能解决你的问题,请参考以下文章

使用Python实现Hadoop MapReduce程序

mapreduce 简单函数 - 使用 python

如何使用Python为Hadoop编写一个简单的MapReduce程序

如何使用 MapReduce 在 python 中计算两个变量之间的相关性

MapReduce 作业(用 python 编写)在 EMR 上运行缓慢

如何使用Python为Hadoop编写一个简单的MapReduce程序