“步骤1的计数器:没有计数器发现”使用Hadoop和mrjob

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了“步骤1的计数器:没有计数器发现”使用Hadoop和mrjob相关的知识,希望对你有一定的参考价值。

我有一个python文件在Hadoop(版本2.6.0)上使用mrjob来计算bigrams,但是我没有得到我希望的输出,而且我在我的终端中解密输出时遇到了问题。我错了。

我的代码:

regex_for_words = re.compile(r"[w']+")

class BiCo(MRJob):
  OUTPUT_PROTOCOL = mrjob.protocol.RawProtocol

  def mapper(self, _, line):
    words = regex_for_words.findall(line)
    wordsinline = list()
    for word in words:
        wordsinline.append(word.lower()) 
    wordscounter = 0
    totalwords = len(wordsinline)
    for word in wordsinline:
        if wordscounter < (totalwords - 1):
            nextword_pos = wordscounter+1
            nextword = wordsinline[nextword_pos]
            bigram = word, nextword
            wordscounter +=1
            yield (bigram, 1)

  def combiner(self, bigram, counts):
    yield (bigram, sum(counts))

  def reducer(self, bigram, counts):
    yield (bigram, str(sum(counts)))

if __name__ == '__main__':
  BiCo.run()

我在我的本地机器上的mapper函数中编写了代码(基本上,通过“yield”行的所有内容)以确保我的代码按照预期抓住了bigrams,所以我认为它应该工作得很好....但是,事情,有些事情发生了变化。

当我在Hadoop服务器上运行代码时,我得到以下输出(如果超过必要的话,道歉 - 屏幕输出大量信息,我还不确定在问题区域进行珩磨会有什么帮助):

HADOOP: 2015-10-25 17:00:46,992 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Running job: job_1438612881113_6410
HADOOP: 2015-10-25 17:00:52,110 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1376)) - Job job_1438612881113_6410 running in uber mode : false
HADOOP: 2015-10-25 17:00:52,111 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 0% reduce 0%
HADOOP: 2015-10-25 17:00:58,171 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 33% reduce 0%
HADOOP: 2015-10-25 17:01:00,184 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 100% reduce 0%
HADOOP: 2015-10-25 17:01:07,222 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 100% reduce 100%
HADOOP: 2015-10-25 17:01:08,239 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1394)) - Job job_1438612881113_6410 completed successfully
HADOOP: 2015-10-25 17:01:08,321 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1401)) - Counters: 51
HADOOP:         File System Counters
HADOOP:                 FILE: Number of bytes read=2007840
HADOOP:                 FILE: Number of bytes written=4485245
HADOOP:                 FILE: Number of read operations=0
HADOOP:                 FILE: Number of large read operations=0
HADOOP:                 FILE: Number of write operations=0
HADOOP:                 HDFS: Number of bytes read=1013129
HADOOP:                 HDFS: Number of bytes written=0
HADOOP:                 HDFS: Number of read operations=12
HADOOP:                 HDFS: Number of large read operations=0
HADOOP:                 HDFS: Number of write operations=2
HADOOP:         Job Counters
HADOOP:                 Killed map tasks=1
HADOOP:                 Launched map tasks=4
HADOOP:                 Launched reduce tasks=1
HADOOP:                 Rack-local map tasks=4
HADOOP:                 Total time spent by all maps in occupied slots (ms)=33282
HADOOP:                 Total time spent by all reduces in occupied slots (ms)=12358
HADOOP:                 Total time spent by all map tasks (ms)=16641
HADOOP:                 Total time spent by all reduce tasks (ms)=6179
HADOOP:                 Total vcore-seconds taken by all map tasks=16641
HADOOP:                 Total vcore-seconds taken by all reduce tasks=6179
HADOOP:                 Total megabyte-seconds taken by all map tasks=51121152
HADOOP:                 Total megabyte-seconds taken by all reduce tasks=18981888
HADOOP:         Map-Reduce Framework
HADOOP:                 Map input records=28214
HADOOP:                 Map output records=133627
HADOOP:                 Map output bytes=2613219
HADOOP:                 Map output materialized bytes=2007852
HADOOP:                 Input split bytes=304
HADOOP:                 Combine input records=133627
HADOOP:                 Combine output records=90382
HADOOP:                 Reduce input groups=79518
HADOOP:                 Reduce shuffle bytes=2007852
HADOOP:                 Reduce input records=90382
HADOOP:                 Reduce output records=0
HADOOP:                 Spilled Records=180764
HADOOP:                 Shuffled Maps =3
HADOOP:                 Failed Shuffles=0
HADOOP:                 Merged Map outputs=3
HADOOP:                 GC time elapsed (ms)=93
HADOOP:                 CPU time spent (ms)=7940
HADOOP:                 Physical memory (bytes) snapshot=1343377408
HADOOP:                 Virtual memory (bytes) snapshot=14458105856
HADOOP:                 Total committed heap usage (bytes)=4045406208
HADOOP:         Shuffle Errors
HADOOP:                 BAD_ID=0
HADOOP:                 CONNECTION=0
HADOOP:                 IO_ERROR=0
HADOOP:                 WRONG_LENGTH=0
HADOOP:                 WRONG_MAP=0
HADOOP:                 WRONG_REDUCE=0
HADOOP:         Unencodable output
HADOOP:                 TypeError=79518
HADOOP:         File Input Format Counters
HADOOP:                 Bytes Read=1012825
HADOOP:         File Output Format Counters
HADOOP:                 Bytes Written=0
HADOOP: 2015-10-25 17:01:08,321 INFO  [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1022)) - Output directory: hdfs:///user/andersaa/si601f15lab5_output
Counters from step 1:
  (no counters found)

我对于为什么在步骤1中找不到计数器感到困惑(我假设它是我的代码的映射器部分,这可能是一个错误的假设)。如果我正确地读取任何Hadoop输出,看起来它至少使它成为reduce阶段(因为有Reduce Input组)并且它没有找到任何Shuffling错误。我认为在“Unencodable输出:TypeError = 79518”中可能会出现一些问题的答案,但是我所做的谷歌搜索量没有帮助我了解这是什么错误。

非常感谢任何帮助或见解。

答案

一个问题是映射器的二元组的编码。上面编码的方式,bigram是python类型“元组”:

>>> word = 'the'
>>> word2 = 'boy'
>>> bigram = word, word2
>>> type(bigram)
<type 'tuple'>

通常,普通字符串用作键。因此,将字符串创建为字符串。你可以这样做的一种方法是:

bigram = '-'.join((word, nextword))

当我在程序中进行更改时,我会看到如下输出:

automatic-translation   1
automatic-vs    1
automatically-focus 1
automatically-learn 1
automatically-learning  1
automatically-translate 1
available-including 1
available-without   1

另一个提示:在命令行上尝试使用-q来消除所有hadoop中间噪音。有时它只是妨碍了。

HTH。

另一答案

这是缓存错误。我大部分都是用Hortonworks沙箱发现的。简单的解决方案是从沙箱和ssh再次注销。

以上是关于“步骤1的计数器:没有计数器发现”使用Hadoop和mrjob的主要内容,如果未能解决你的问题,请参考以下文章

为啥要使用Hadoop

安装 hadoop 3.1.2 警告:HADOOP_PREFIX 已被 HADOOP_HOME 取代。使用 HADOOP_PREFIX 的值

hadoop安装及speak安装

Hadoop学习笔记——Hadoop经常使用命令

如何安装单节点的hadoop

hadoop.home.dir在哪儿