使用大象鸟罐子的猪 JsonLoader() 处理 twitter 的复杂 json 对象时出错

Posted

技术标签:

【中文标题】使用大象鸟罐子的猪 JsonLoader() 处理 twitter 的复杂 json 对象时出错【英文标题】:Error processing complex json object of twitter with pig JsonLoader() of elephant-bird Jars 【发布时间】:2015-08-25 05:58:28 【问题描述】:

我想使用大象鸟罐子处理带有猪的 twitter json 对象,我为此编写了猪脚本,如下所示。

REGISTER '/usr/lib/pig/lib/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/usr/lib/pig/lib/elephant-bird-pig-4.1.jar';

A = LOAD '/user/flume/tweets/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'created_at' AS createdAT;
DUMP B;

这给了我如下错误

2015-08-25 11:06:34,295 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1439883208520_0177
2015-08-25 11:06:34,295 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A,B
2015-08-25 11:06:34,295 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: A[3,4],B[4,4] C:  R:
2015-08-25 11:06:34,303 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-08-25 11:06:34,303 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1439883208520_0177]
2015-08-25 11:07:06,449 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-08-25 11:07:06,449 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1439883208520_0177]
2015-08-25 11:07:09,458 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-08-25 11:07:09,458 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1439883208520_0177 has failed! Stop running all dependent jobs
2015-08-25 11:07:09,459 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-08-25 11:07:09,667 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://trinityhadoopmaster.com:8188/ws/v1/timeline/
2015-08-25 11:07:09,668 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at trinityhadoopmaster.com/192.168.1.135:8032
2015-08-25 11:07:09,678 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server
2015-08-25 11:07:09,779 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: java.lang.ClassNotFoundException: org.json.simple.parser.ParseException
2015-08-25 11:07:09,779 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2015-08-25 11:07:09,780 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
2.6.0   0.14.0  hdfs    2015-08-25 11:06:33     2015-08-25 11:07:09     UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1439883208520_0177  A,B     MAP_ONLY        Message: Job failed!    hdfs://trinityhadoopmaster.com:9000/tmp/temp1554332510/tmp835744559,

Input(s):
Failed to read data from "hdfs://trinityhadoopmaster.com:9000/user/flume/tweets/data.json"

Output(s):
Failed to produce result in "hdfs://trinityhadoopmaster.com:9000/tmp/temp1554332510/tmp835744559"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1439883208520_0177


2015-08-25 11:07:09,780 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2015-08-25 11:07:09,787 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B. Backend error : java.lang.ClassNotFoundException: org.json.simple.parser.ParseException
Details at logfile: /tmp/pig-err.log
grunt>

我不知道如何处理,任何人都可以帮助我解决这个问题。

【问题讨论】:

对于在寻找ERROR 1066: Unable to open iterator for alias 时发现此帖子的人,这里是generic solution。 【参考方案1】:
REGISTER '/tmp/elephant-bird-core-4.1.jar';

REGISTER '/tmp/elephant-bird-pig-4.1.jar';

REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';

REGISTER '/tmp/google-collections-1.0.jar';

REGISTER '/tmp/json-simple-1.1.jar';

有效。

【讨论】:

以上是关于使用大象鸟罐子的猪 JsonLoader() 处理 twitter 的复杂 json 对象时出错的主要内容,如果未能解决你的问题,请参考以下文章

我如何使用 jsonloader 为数组定义模式?

使用 Elephant-bird-pig 中的 JsonLoader 时出错

Pig:使用实际的 key_name 和值创建 json 文件

使用大象鸟加载 json - 简单任务出错

尝试使用 LzoPigStorage 和大象鸟加载索引 LZO 文件

使用带蜂巢的大象鸟来读取 protobuf 数据