转储 Json 数据时出现 Apache Pig 错误

Posted

技术标签:

【中文标题】转储 Json 数据时出现 Apache Pig 错误【英文标题】:Apache Pig error while dumping Json data 【发布时间】:2016-04-17 08:27:05 【问题描述】:

我有一个 JSON 文件并想使用 Apache Pig 加载。

我正在使用内置的 JSONLOADER 加载 json 数据,下面是示例 json 数据。

cat jsondata1.json
 "response":  "id": 10123, "thread": "Sloths", "comments": ["Sloths are adorable So chill"] , "response_time": 0.425 
 "response":  "id": 13828, "thread": "Bigfoot", "comments": ["hello world"]  , "response_time": 0.517 

这里我使用内置的 Json 加载器加载 json 数据。加载时没有错误,但在转储数据时会出现以下错误。

grunt> a = load '/home/cloudera/jsondata1.json' using JsonLoader('response:tuple (id:int, thread:chararray, comments:bag tuple(comment:chararray)), response_time:double');

grunt> dump a;

2016-04-17 01:11:13,286 [pool-4-thread-1] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/home/cloudera/jsondata1.json:0+229
2016-04-17 01:11:13,287 [pool-4-thread-1] WARN  org.apache.hadoop.conf.Configuration - dfs.https.address is deprecated. Instead, use dfs.namenode.https-address
2016-04-17 01:11:13,311 [pool-4-thread-1] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-04-17 01:11:13,321 [pool-4-thread-1] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: a[5,4] C:  R: 
2016-04-17 01:11:13,349 [Thread-16] INFO  org.apache.hadoop.mapred.LocalJobRunner - Map task executor complete.
2016-04-17 01:11:13,351 [Thread-16] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local801054416_0004
java.lang.Exception: org.codehaus.jackson.JsonParseException: Current token (FIELD_NAME) not numeric, can not use numeric value accessors
 at [Source: java.io.ByteArrayInputStream@2484de3c; line: 1, column: 120]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: org.codehaus.jackson.JsonParseException: Current token (FIELD_NAME) not numeric, can not use numeric value accessors
 at [Source: java.io.ByteArrayInputStream@2484de3c; line: 1, column: 120]
    at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
    at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
    at org.codehaus.jackson.impl.JsonNumericParserBase._parseNumericValue(JsonNumericParserBase.java:399)
    at org.codehaus.jackson.impl.JsonNumericParserBase.getDoubleValue(JsonNumericParserBase.java:311)
    at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:203)
    at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
2016-04-17 01:11:13,548 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local801054416_0004
2016-04-17 01:11:13,548 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases a
2016-04-17 01:11:13,548 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: a[5,4] C:  R: 
2016-04-17 01:11:18,059 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2016-04-17 01:11:18,059 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local801054416_0004 has failed! Stop running all dependent jobs
2016-04-17 01:11:18,059 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2016-04-17 01:11:18,059 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2016-04-17 01:11:18,060 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete
2016-04-17 01:11:18,060 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.0.0-cdh4.7.0  0.11.0-cdh4.7.0 cloudera    2016-04-17 01:11:12 2016-04-17 01:11:18 UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_local801054416_0004 a   MAP_ONLY    Message: Job failed!    file:/tmp/temp-1766116741/tmp1151698221,

Input(s):
Failed to read data from "/home/cloudera/jsondata1.json"

Output(s):
Failed to produce result in "file:/tmp/temp-1766116741/tmp1151698221"

Job DAG:
job_local801054416_0004


2016-04-17 01:11:18,060 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2016-04-17 01:11:18,061 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias a
Details at logfile: /home/cloudera/pig_1460877001124.log

我无法找到问题所在。我可以知道如何为上述 json 数据定义正确的架构吗?。

【问题讨论】:

【参考方案1】:

试试这个:

comments:(chararray)

因为这个版本:

comments:bag tuple(comment:chararray)

适合这个 JSON 模式:

"comments": [comment:"hello world"]

你有简单的字符串值,而不是另一个嵌套文档:

"comments": ["hello world"]

【讨论】:

以上是关于转储 Json 数据时出现 Apache Pig 错误的主要内容,如果未能解决你的问题,请参考以下文章

在 pig 中使用 UDF 时出现错误 1070

在 Django 中,尝试转储数据时出现“错误:无法序列化数据库”?

为啥在构建 Android 11 时出现非法指令(核心转储)错误?

转储数据集时将数据从 Hive 加载到 Pig 错误

在 Jython 的 Pig UDF 中导入外部库时出现错误 1121

运行 Pig 脚本时出现异常