Pig DUMP 卡在 GROUP

Posted

技术标签:

【中文标题】Pig DUMP 卡在 GROUP【英文标题】:Pig DUMP gets stuck in GROUP 【发布时间】:2012-06-01 00:50:11 【问题描述】:

我是 PIG 初学者(使用 pig 0.10.0),我有一些简单的 JSON,如下所示:

test.json:


  "from": "1234567890",
  .....
  "profile": 
      "email": "me@domain.com"
      .....
  

我在猪中进行了一些分组/计数:

>pig -x local

使用以下 PIG 脚本:

REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;

users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);

domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */

grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */

基本上,当我尝试转储 grouped_domain_user 时,pig 卡住了,似乎在等待地图输出完成:

2012-05-31 17:45:22,111 [Thread-15] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:31,122 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:37,123 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:43,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:46,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:52,126 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:58,127 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:46:01,128 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
.... repeats ....

欢迎就为什么会发生这种情况提出建议。

谢谢!

更新

克里斯为我解决了这个问题。我将fs.default.name 等设置为更正pig.properties 中的值,但是我也将HADOOP_CONF_DIR 环境变量设置为指向我的本地Hadoop 安装,并使用<final>true</final> 设置这些相同的值。

很棒的发现,非常感谢。

【问题讨论】:

问题 - 就像最近的一篇文章一样,您没有将 fs.default.namemapred.job.tracker 配置属性标记为 final 吗? - ***.com/questions/10720132/… 我实际上只是将它们设置在 pig.properties 文件中。我会检查以确保我没有任何 hadoop 版本在路径中徘徊。 作为额外的仅供参考,在实时集群上运行的相同脚本运行良好。 伟大的克里斯!这就是问题所在。我猜我的自制 hadoop 安装 conf 文件正在被读取(那些参数被设置为 final)。 【参考方案1】:

将此问题标记为已回答,并致那些将来遇到此问题的人:

在本地模式下运行时(无论是通过 pig -x local 为 pig 运行,还是向本地作业运行器提交 map reduce 作业,如果您看到 reduce 阶段“挂起”,尤其是当您看到日志类似于:

2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - 
      attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress

那么您的工作虽然开始于本地模式,但可能已切换到“集群”模式,因为 mapred.job.tracker 属性在 $HADOOP/conf/mapred-site.xml 中标记为“最终”:

<property>
    <name>mapred.job.tracker</name>
    <value>hdfs://localhost:9000</value>
    <final>true</final>
</property>

您还应该检查 core-site.xml 中的 fs.default.name 属性,并确保它没有被标记为 final

这意味着您无法在运行时设置此值,甚至可能会看到类似以下的错误消息:

12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name;  Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker;  Ignoring.

【讨论】:

以上是关于Pig DUMP 卡在 GROUP的主要内容,如果未能解决你的问题,请参考以下文章

在 Python 中绑定到 Pig STORE 或 DUMP 输出

Oozie Pig 动作卡在 PREP 状态,作业处于 RUNNING 状态

Pig:Relation 和 Schema 名称混淆

Pig - 读取存储为 Avro 的 Hive 表

Pig 的“转储”在 AWS 上不起作用

在 Hortonworks Sandbox 内的 Pig 脚本中加载 JSON 文件