原创问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException

Posted barneywill

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了原创问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException相关的知识,希望对你有一定的参考价值。

spark查orc格式的数据有时会报这个错

Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more

跟进代码

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

  static enum SplitStrategyKind {
    HYBRID,
    BI,
    ETL
  }
...

    Context(Configuration conf) {
      this.conf = conf;
      minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
      maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
      String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
      if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
        splitStrategyKind = SplitStrategyKind.HYBRID;
      } else {
        LOG.info("Enforcing " + ss + " ORC split strategy");
        splitStrategyKind = SplitStrategyKind.valueOf(ss);
      }

...
        switch(context.splitStrategyKind) {
          case BI:
            // BI strategy requested through config
            splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
                deltas, covered);
            break;
          case ETL:
            // ETL strategy requested through config
            splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
                deltas, covered);
            break;
          default:
            // HYBRID strategy
            if (avgFileSize > context.maxSize) {
              splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
                  covered);
            } else {
              splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
                  covered);
            }
            break;
        }

 

org.apache.hadoop.hive.conf.HiveConf.ConfVars

    HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
        "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
        " as opposed to query execution (split generation does not read or cache file footers)." +
        " ETL strategy is used when spending little more time in split generation is acceptable" +
        " (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
        " based on heuristics."),

 

可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足

if (avgFileSize > context.maxSize) {

splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);

报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题

set hive.exec.orc.split.strategy=ETL

问题暂时解决,未完待续;

 





以上是关于原创问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的主要内容,如果未能解决你的问题,请参考以下文章

原创问题定位分享(19)spark task在executors上分布不均

原创问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutpu(代

原创问题定位分享(18)beeline连接spark thrift有时会卡住

spark:区分大小写的 partitionBy 列

Hive上游为ORC格式的表,下游读取不完整

Spark Sql 从 Hive orc 分区表中读取,给出数组越界异常