原创问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
Posted barneywill
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了原创问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException相关的知识,希望对你有一定的参考价值。
spark查orc格式的数据有时会报这个错
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more
跟进代码
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
static enum SplitStrategyKind { HYBRID, BI, ETL } ... Context(Configuration conf) { this.conf = conf; minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE); maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE); String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname); if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) { splitStrategyKind = SplitStrategyKind.HYBRID; } else { LOG.info("Enforcing " + ss + " ORC split strategy"); splitStrategyKind = SplitStrategyKind.valueOf(ss); } ... switch(context.splitStrategyKind) { case BI: // BI strategy requested through config splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); break; case ETL: // ETL strategy requested through config splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); break; default: // HYBRID strategy if (avgFileSize > context.maxSize) { splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); } else { splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); } break; }
org.apache.hadoop.hive.conf.HiveConf.ConfVars
HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"), "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" + " as opposed to query execution (split generation does not read or cache file footers)." + " ETL strategy is used when spending little more time in split generation is acceptable" + " (split generation reads and caches file footers). HYBRID chooses between the above strategies" + " based on heuristics."),
可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足
if (avgFileSize > context.maxSize) {
则
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题
set hive.exec.orc.split.strategy=ETL
问题暂时解决,未完待续;
以上是关于原创问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的主要内容,如果未能解决你的问题,请参考以下文章
原创问题定位分享(19)spark task在executors上分布不均
原创问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutpu(代