Hudi编译适配hadoop3.2.4

Posted 2022-11-23 YoungerChina

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Hudi编译适配hadoop3.2.4相关的知识，希望对你有一定的参考价值。

本文讲解hudi如何编译适配Hadoop3.x，其中hudi采用版本0.12.1，hadoop采用版本3.2.4。

1.默认配置

Hudi的编译配置文件为<HUDI_HOME>/pom.xml（其中<HUDI_HOME>是Hudi根目录），默认配置的hadoop版本为2.10.1。相关配置如下图：

编译参数-Dhadoop：表示选择hadoop2.10.1；

2.编译方法

适配Hadoop3.x的hudi编译有两种方案。（本示例中采用方案二）

一、通过编译命令指定

mvn clean install -DskipTests -Drat.skip=true -Dhadoop.version=3.2.4

二、修改配置文件

修改<HUDI_HOME>/pom.xml中hadoop.version为3.2.4

然后，通过如下命令编译适配hadoop：

mvn clean install -DskipTests -Drat.skip=true -Dhadoop

编译成功有如下提示：

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11:25 min
[INFO] Finished at: 2022-10-17T11:01:35+08:00
[INFO] ------------------------------------------------------------------------

一定要出现“BUILD SUCCESS“提示。

然后执行hudi-cli.sh：

# ./hudi-cli/hudi-cli.sh 
Client jar location not set, please set it in conf/hudi-env.sh
……
Welcome to Apache Hudi CLI. Please type help if you are looking for help. 
hudi->

3.问题

HoodieParquetDataBlock找不到合适的ByteArrayOutputStream构造器

一、问题描述

Hudi0.12.0编译配适hadoop 3.2.4版本时，编译异常：FSDataOutputStream(java.io.ByteArrayOutputStream), 找不到合适的构造器。错误日志如下：

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project hudi-common: Compilation failure
[ERROR] /D:/Workspace/Apache/apache-hudi/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java:[112,44] 对于FSDataOutputStream(java.io.ByteArrayOutputStream), 
[ERROR] org.apache.hadoop.fs.FSDataOutputStream.FSDataOutputStream(java.io.OutputStream,org.apache.hadoop.fs.FileSystem.Statistics)
[ERROR] org.apache.hadoop.fs.FSDataOutputStream.FSDataOutputStream(java.io.OutputStream,org.apache.hadoop.fs.FileSystem.Statistics,long)
[ERROR]org.apache.hadoop.fs.FSDataOutputStream.FSDataOutputStream(java.io.OutputStream,org.apache.hadoop.fs.FileSystem.Statistics,long)
[ERROR]

二、问题分析

问题代码如下：

    try (FSDataOutputStream outputStream = new FSDataOutputStream(baos)) 
      try (HoodieParquetStreamWriter<IndexedRecord> parquetWriter = new HoodieParquetStreamWriter<>(outputStream, avroParquetConfig)) 
        for (IndexedRecord record : records) 
          String recordKey = getRecordKey(record).orElse(null);
          parquetWriter.writeAvro(recordKey, record);
        
        outputStream.flush();

查看FSDataOutputStream的代码，可以看到FSDataOutputStream只有2个构造器了。新的构造器加入了Statistics stats和long startPosition两个参数，用于进行输出流的io统计。

  public FSDataOutputStream(OutputStream out, Statistics stats) 
    this(out, stats, 0L);
  

  public FSDataOutputStream(OutputStream out, Statistics stats, long startPosition) 
    super(new FSDataOutputStream.PositionCache(out, stats, startPosition));
    this.wrappedStream = out;

需要激活统计的话，至少需要传入Statistics。Statistics有两个构造函数，一个是传入系统的schema,一个是传入另一个Statistics对象。第二个构造器是不可能了，那第一个可以吗？在hudi的HoodieParquetDataBlock可以拿到系统的schema吗？

    public Statistics(String scheme) 
      this.scheme = scheme;
      this.rootData = new FileSystem.Statistics.StatisticsData();
      this.threadData = new ThreadLocal();
      this.allData = new HashSet();
    

    public Statistics(FileSystem.Statistics other) 
      this.scheme = other.scheme;
      this.rootData = new FileSystem.Statistics.StatisticsData();
      other.visitAll(new FileSystem.Statistics.StatisticsAggregator<Void>() 
        public void accept(FileSystem.Statistics.StatisticsData data) 
          Statistics.this.rootData.add(data);
        

        public Void aggregate() 
          return null;
        
      );
      this.threadData = new ThreadLocal();
      this.allData = new HashSet();

发现拿不到，所以只能传空。修改代码文件./hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java。

查看hudi的其他代码，传空是普遍现象哦。重新编译后，顺利编译通过，可以放心使用。

以上是关于Hudi编译适配hadoop3.2.4的主要内容，如果未能解决你的问题，请参考以下文章