Pig 脚本因 java.io.EOFException 失败:输入流意外结束

Posted

技术标签:

【中文标题】Pig 脚本因 java.io.EOFException 失败:输入流意外结束【英文标题】:Pig script fails with java.io.EOFException: Unexpected end of input stream 【发布时间】:2014-09-03 10:20:11 【问题描述】:

我有一个 Pig 脚本来使用正则表达式获取一组字段并将数据存储到 Hive 表中。

--Load data

cisoFortiGateDataAll = LOAD '/user/root/cisodata/Logs/Fortigate/ec-pix-log.20140625.gz' USING TextLoader AS (line:chararray);

--There are two types of data, filter type1 - The field dst_country seems unique there

cisoFortiGateDataType1 = FILTER cisoFortiGateDataAll BY (line matches '.*dst_country.*');

--Parse each line and pick up the required fields

cisoFortiGateDataType1Required = FOREACH cisoFortiGateDataType1 GENERATE
 FLATTEN(
 REGEX_EXTRACT_ALL(line, '(.*?)\\s(.*?)\\s(.*?)\\s(.*?)\\sdate=(.*?)\\s+time=(.*?)\\sdevname=(.*?)\\sdevice_id=(.*?)\\slog_id=(.*?)\\stype=(.*?)\\ssubtype=(.*?)\\spri=(.*?)\\svd=(.*?)\\s-s-rc=(.*?)\\s-s-rc_port=(.*?)\\s-s-rc_int=(.*?)\\sdst=(.*?)\\sdst_port=(.*?)\\sdst_int=(.*?)\\sSN=(.*?)\\sstatus=(.*?)\\spolicyid=(.*?)\\sdst_country=(.*?)\\s-s-rc_country=(.*?)\\s(.*?\\s.*)+')
 ) AS (
 rmonth:charArray, rdate:charArray, rtime:charArray, ip:charArray, date:charArray, time:charArray,
 devname:charArray, deviceid:charArray, logid:charArray, type:charArray, subtype:charArray,
 pri:charArray, vd:charArray, src:charArray, srcport:charArray, srcint:charArray, dst:charArray,
 dstport:charArray, dstint:charArray, sn:charArray, status:charArray, policyid:charArray,
 dstcountry:charArray, srccountry:charArray, rest:charArray );

--Store to hive table 

STORE cisoFortiGateDataType1Required INTO 'ciso_db.fortigate_type1_1_table' USING org.apache.hcatalog.pig.HCatStorer();

脚本在小文件上运行良好,但在较大文件 (750 MB) 上出现以下异常。知道如何调试并找到根本原因吗?

2014-09-03 15:31:33,562 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - java.io.EOFException: Unexpected end of input stream
        at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
        at java.io.InputStream.read(InputStream.java:101)
        at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
        at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
        at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:58)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)

【问题讨论】:

【参考方案1】:

检查您正在加载到 line:chararray 中的文本的大小。如果大小大于 hdfs 块大小 (64 MB),那么您将收到错误消息。

【讨论】:

以上是关于Pig 脚本因 java.io.EOFException 失败:输入流意外结束的主要内容,如果未能解决你的问题,请参考以下文章

Java & Pig - 是不是可以将 pig 脚本的输出转换为 Java 变量?

Pig 脚本最短执行时间

从 pig 脚本运行时,PIG 未从 hdfs 读取文件

尝试使用 JAVA 启动 Pig 脚本时出错

Pig 脚本:STORE 命令不起作用

运行 Pig 脚本时出现异常