Pig 脚本因 java.io.EOFException 失败:输入流意外结束
Posted
技术标签:
【中文标题】Pig 脚本因 java.io.EOFException 失败:输入流意外结束【英文标题】:Pig script fails with java.io.EOFException: Unexpected end of input stream 【发布时间】:2014-09-03 10:20:11 【问题描述】:我有一个 Pig 脚本来使用正则表达式获取一组字段并将数据存储到 Hive 表中。
--Load data
cisoFortiGateDataAll = LOAD '/user/root/cisodata/Logs/Fortigate/ec-pix-log.20140625.gz' USING TextLoader AS (line:chararray);
--There are two types of data, filter type1 - The field dst_country seems unique there
cisoFortiGateDataType1 = FILTER cisoFortiGateDataAll BY (line matches '.*dst_country.*');
--Parse each line and pick up the required fields
cisoFortiGateDataType1Required = FOREACH cisoFortiGateDataType1 GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '(.*?)\\s(.*?)\\s(.*?)\\s(.*?)\\sdate=(.*?)\\s+time=(.*?)\\sdevname=(.*?)\\sdevice_id=(.*?)\\slog_id=(.*?)\\stype=(.*?)\\ssubtype=(.*?)\\spri=(.*?)\\svd=(.*?)\\s-s-rc=(.*?)\\s-s-rc_port=(.*?)\\s-s-rc_int=(.*?)\\sdst=(.*?)\\sdst_port=(.*?)\\sdst_int=(.*?)\\sSN=(.*?)\\sstatus=(.*?)\\spolicyid=(.*?)\\sdst_country=(.*?)\\s-s-rc_country=(.*?)\\s(.*?\\s.*)+')
) AS (
rmonth:charArray, rdate:charArray, rtime:charArray, ip:charArray, date:charArray, time:charArray,
devname:charArray, deviceid:charArray, logid:charArray, type:charArray, subtype:charArray,
pri:charArray, vd:charArray, src:charArray, srcport:charArray, srcint:charArray, dst:charArray,
dstport:charArray, dstint:charArray, sn:charArray, status:charArray, policyid:charArray,
dstcountry:charArray, srccountry:charArray, rest:charArray );
--Store to hive table
STORE cisoFortiGateDataType1Required INTO 'ciso_db.fortigate_type1_1_table' USING org.apache.hcatalog.pig.HCatStorer();
脚本在小文件上运行良好,但在较大文件 (750 MB) 上出现以下异常。知道如何调试并找到根本原因吗?
2014-09-03 15:31:33,562 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - java.io.EOFException: Unexpected end of input stream
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:58)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
【问题讨论】:
【参考方案1】:检查您正在加载到 line:chararray 中的文本的大小。如果大小大于 hdfs 块大小 (64 MB),那么您将收到错误消息。
【讨论】:
以上是关于Pig 脚本因 java.io.EOFException 失败:输入流意外结束的主要内容,如果未能解决你的问题,请参考以下文章