hadoop 模式下的 Mrjob:启动作业时出错,输入路径错误:文件不存在
Posted
技术标签:
【中文标题】hadoop 模式下的 Mrjob:启动作业时出错,输入路径错误:文件不存在【英文标题】:Mrjob in hadoop mode: Error launching job , bad input path : File does not exist 【发布时间】:2016-12-24 15:04:34 【问题描述】:我正在尝试运行Mrjob example 来自我的笔记本电脑上的 Hadoop with Python 一书,以伪分布式模式。
(文件salary.csv可以在here找到)
这样我就可以启动namenode和datanode了:
start-dfs.sh
返回:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-namenode-me-Notebook-PC.out
localhost: starting datanode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-datanode-me-Notebook-PC.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-secondarynamenode-me-Notebook-PC.out
创建输入文件结构和复制我也没有问题
salaries.csv
致 hdfs:
hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/me/
hdfs dfs -mkdir /user/me/input/
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/
hdfs dfs -ls /user/me/input/
返回:
Found 1 items
-rw-r--r-- 3 me supergroup 1771685 2016-12-24 15:57 /user/me/input/salaries.csv
我也让top_salaries.py
可执行:
sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py
在本地模式下启动 top_salaries.py
也可以:
python2 top_salaries.py -r local salaries.csv > answer.csv
返回:
No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_salaries.me.20161224.195052.762894
Running step 1 of 1...
Counters: 1
warn
missing gross=3223
Counters: 1
warn
missing gross=3223
Streaming final output from /tmp/top_salaries.me.20161224.195052.762894/output...
Removing temp directory /tmp/top_salaries.me.20161224.195052.762894...
但是,在 hadoop 上运行这个作业(把东西放在一起)python2 top_salaries.py -r hadoop hdfs:///user/me/input/salaries.csv
返回:
No configs found; falling back on auto-configuration
Looking for hadoop binary in $PATH...
Found hadoop binary: /home/me/hadoop-2.7.3/bin/hadoop
Using Hadoop version 2.7.3
Looking for Hadoop streaming jar in /home/me/hadoop-2.7.3...
Found Hadoop streaming jar: /home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
Creating temp directory /tmp/top_salaries.me.20161224.195201.967990
Copying local files to hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/...
Running step 1 of 1...
session.id is deprecated. Instead, use dfs.metrics.session-id
Initializing JVM Metrics with processName=JobTracker, sessionId=
Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
Cleaning up the staging area file:/tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001
Error launching job , bad input path : File does not exist: /tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001/files/mrjob.zip#mrjob.zip
Streaming Command Failed!
Attempting to fetch counters from logs...
Can't fetch history log; missing job ID
No counters found
Scanning logs for probable cause of failure...
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Step 1 of 1 failed: Command '['/home/me/hadoop-2.7.3/bin/hadoop', 'jar', '/home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar', '-files', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/mrjob.zip#mrjob.zip,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/top_salaries.py#top_salaries.py', '-input', 'hdfs:///user/me/input/salaries.csv', '-output', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/output', '-mapper', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --mapper', '-combiner', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --combiner', '-reducer', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --reducer']' returned non-zero exit status 512
编辑:
这是我的 core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
这是我的 hdfs-site.xml:
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/me/Desktop/work/cv/hadoop/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/me/Desktop/work/cv/hadoop/datanode</value>
</property>
</configuration>
(其他 xml 配置文件,我没有编辑/更改)
编辑2:
这里是python脚本(和上面的github链接一样)
from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
cols = 'Name,JobTitle,AgencyID,Agency,HireDate,AnnualSalary,GrossPay'.split(',')
class salarymax(MRJob):
def mapper(self, _, line):
# Convert each line into a dictionary
row = dict(zip(cols, [ a.strip() for a in csv.reader([line]).next()]))
# Yield the salary
yield 'salary', (float(row['AnnualSalary'][1:]), line)
# Yield the gross pay
try:
yield 'gross', (float(row['GrossPay'][1:]), line)
except ValueError:
self.increment_counter('warn', 'missing gross', 1)
def reducer(self, key, values):
topten = []
# For 'salary' and 'gross' compute the top 10
for p in values:
topten.append(p)
topten.sort()
topten = topten[-10:]
for p in topten:
yield key, p
combiner = reducer
if __name__ == '__main__':
salarymax.run()
【问题讨论】:
找不到文件 /tmp/hadoop-me/mapred/staging/me118248587/.staging/job_local118248587_0001/files/mrjob.zip#mrjob.zip 检查您的文件副本。 xml 文件没关系,我看到路径以 /tmp/hadoop-me 、 hdfs:///user/me 、 hdfs:///user/hduser 开头,有点乱。作业找不到mrjob.zip#mrjob.zip,请检查您如何为hadoop设置输入文件 哈!接得好。但是我该怎么做才能解决这个问题?我现在可以看到它很乱,但是我在哪里设置这些目录以便更整洁? hadoop 使用同一个用户,所以所有用户名都相同。例如hdfs dfs -mkdir /user/me/ 而不是 hdfs dfs -mkdir /user/hduser/ 然后检查新的错误日志 添加你的python脚本代码 【参考方案1】:Ok。您需要编辑文件core-site.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
文件hdfs-site.xml
为:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.7.3/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>
您需要将hdfs-site.xml
编辑为:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>
您需要创建一个mapred-site.xml
文件,内容如下:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
您需要编辑yarn-site.xml
以包含:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
然后做:
start-dfs.sh
start-yarn.sh
然后做:
hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/me/
hdfs dfs -mkdir /user/me/input/
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/
现在做:
sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py
python2 top_salaries.py -r hadoop hdfs:///user/me/input/salaries.csv > answer.csv
有效。
【讨论】:
您能否解释一下这个错误的原因以及如何通过这些配置修复它?谢谢你 对不起,这是很多年前的事了。我不记得了。与此同时,Hadoop 可能已经发生了很大变化(我不知道我已经多年没有使用它了)我什至不确定你的问题的答案是否仍然适用于今天。以上是关于hadoop 模式下的 Mrjob:启动作业时出错,输入路径错误:文件不存在的主要内容,如果未能解决你的问题,请参考以下文章
“步骤1的计数器:没有计数器发现”使用Hadoop和mrjob
MRJob 确定是不是运行内联、本地、emr 或 hadoop