Pig DUMP 卡在 GROUP
Posted
技术标签:
【中文标题】Pig DUMP 卡在 GROUP【英文标题】:Pig DUMP gets stuck in GROUP 【发布时间】:2012-06-01 00:50:11 【问题描述】:我是 PIG 初学者(使用 pig 0.10.0),我有一些简单的 JSON,如下所示:
test.json:
"from": "1234567890",
.....
"profile":
"email": "me@domain.com"
.....
我在猪中进行了一些分组/计数:
>pig -x local
使用以下 PIG 脚本:
REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;
users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);
domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */
grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */
基本上,当我尝试转储 grouped_domain_user 时,pig 卡住了,似乎在等待地图输出完成:
2012-05-31 17:45:22,111 [Thread-15] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:31,122 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:37,123 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:43,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:46,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:52,126 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:58,127 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:46:01,128 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
.... repeats ....
欢迎就为什么会发生这种情况提出建议。
谢谢!
更新
克里斯为我解决了这个问题。我将fs.default.name
等设置为更正pig.properties
中的值,但是我也将HADOOP_CONF_DIR
环境变量设置为指向我的本地Hadoop 安装,并使用<final>true</final>
设置这些相同的值。
很棒的发现,非常感谢。
【问题讨论】:
问题 - 就像最近的一篇文章一样,您没有将fs.default.name
和 mapred.job.tracker
配置属性标记为 final 吗? - ***.com/questions/10720132/…
我实际上只是将它们设置在 pig.properties 文件中。我会检查以确保我没有任何 hadoop 版本在路径中徘徊。
作为额外的仅供参考,在实时集群上运行的相同脚本运行良好。
伟大的克里斯!这就是问题所在。我猜我的自制 hadoop 安装 conf 文件正在被读取(那些参数被设置为 final)。
【参考方案1】:
将此问题标记为已回答,并致那些将来遇到此问题的人:
在本地模式下运行时(无论是通过 pig -x local
为 pig 运行,还是向本地作业运行器提交 map reduce 作业,如果您看到 reduce 阶段“挂起”,尤其是当您看到日志类似于:
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask -
attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
那么您的工作虽然开始于本地模式,但可能已切换到“集群”模式,因为 mapred.job.tracker
属性在 $HADOOP/conf/mapred-site.xml 中标记为“最终”:
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9000</value>
<final>true</final>
</property>
您还应该检查 core-site.xml 中的 fs.default.name
属性,并确保它没有被标记为 final
这意味着您无法在运行时设置此值,甚至可能会看到类似以下的错误消息:
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
【讨论】:
以上是关于Pig DUMP 卡在 GROUP的主要内容,如果未能解决你的问题,请参考以下文章
在 Python 中绑定到 Pig STORE 或 DUMP 输出