数据湖架构HudiHudi版本0.12源码编译Hudi集成spark使用IDEA与spark对hudi表增删改查

Posted undo_try

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据湖架构HudiHudi版本0.12源码编译Hudi集成spark使用IDEA与spark对hudi表增删改查相关的知识,希望对你有一定的参考价值。

二、数据湖hudi快速上手

2.1 编译hudi源码

Hadoop3.1.3
Hive3.1.2
Flink1.13.6,scala-2.12
Spark3.2.2,scala-2.12

2.1.1 环境准备

[root@centos04 bin]# mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /opt/apps/apache-maven-3.6.3
Java version: 1.8.0_141, vendor: Oracle Corporation, runtime: /opt/apps/jdk1.8.0_141/jre
Default locale: en_US, platform encoding: UTF-8


[root@centos04 bin]# java -version
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)

2.1.2 下载源码包

wget http://archive.apache.org/dist/hudi/0.12.0/hudi-0.12.0.src.tgz


tar -zxvf ./hudi-0.12.0.src.tgz


[root@centos04 apps]# ll
total 4
drwxr-xr-x.  6 root root   126 Feb 28 18:12 apache-maven-3.6.3
drwxr-xr-x. 22  501 games 4096 Aug 16  2022 hudi-0.12.0
drwxr-xr-x.  8   10   143  255 Jul 12  2017 jdk1.8.0_141

2.1.3 在pom文件中新增repository加速依赖下载

# 编辑pom文件
vim /opt/apps/hudi-0.12.0/pom.xml


# 新增repository加速依赖下载
<repository>
        <id>nexus-aliyun</id>
        <name>nexus-aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        <releases>
            <enabled>true</enabled>
        </releases>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
</repository>

在pom文件中修改依赖的组件版本:

<hadoop.version>3.1.3</hadoop.version>
<hive.version>3.1.2</hive.version>

2.1.4 修改源码兼容hadoop3并添加kafka依赖

Hudi默认依赖的hadoop2,要兼容hadoop3,除了修改版本,还需要修改如下代码:

vim /opt/apps/hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java

否则会因为hadoop2.x和3.x版本兼容问题(找不到合适的FSDataOutputStream构造器)。

  • 有几个kafka的依赖需要手动安装,否则编译会报错。
 通过网址下载:http://packages.confluent.io/archive/5.3/confluent-5.3.4-2.12.zip
 
# 解压后找到以下jar包,上传编译服务器
common-config-5.3.4.jar
common-utils-5.3.4.jar
kafka-avro-serializer-5.3.4.jar
kafka-schema-registry-client-5.3.4.jar

install本地仓库

mvn install:install-file -DgroupId=io.confluent -DartifactId=common-config -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-config-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=common-utils -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-utils-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-avro-serializer -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-avro-serializer-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-schema-registry-client-5.3.4.jar

2.1.5 解决spark模块依赖冲突

修改了Hive版本为3.1.2,其携带的jetty是0.9.3,hudi本身用的0.9.4,存在依赖冲突。

2.1.5.1 修改hudi-spark-bundle的pom文件

目的:排除低版本jetty,添加hudi指定版本的jetty

pom文件位置:vim /opt/apps/hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml (在382行的位置)

    <!-- Hive -->
    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-service</artifactId>
      <version>$hive.version</version>
      <scope>$spark.bundle.hive.scope</scope>
      <exclusions>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-service-rpc</artifactId>
      <version>$hive.version</version>
      <scope>$spark.bundle.hive.scope</scope>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>$hive.version</version>
      <scope>$spark.bundle.hive.scope</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-metastore</artifactId>
      <version>$hive.version</version>
      <scope>$spark.bundle.hive.scope</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-common</artifactId>
      <version>$hive.version</version>
      <scope>$spark.bundle.hive.scope</scope>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty.orbit</groupId>
          <artifactId>javax.servlet</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!-- 增加hudi配置版本的jetty -->
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <version>$jetty.version</version>
    </dependency>

    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-util</artifactId>
      <version>$jetty.version</version>
    </dependency>

    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-webapp</artifactId>
      <version>$jetty.version</version>
    </dependency>

    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-http</artifactId>
      <version>$jetty.version</version>
    </dependency>

否则在使用spark向hudi表插入数据时,会报错

java.lang.NoSuchMethodError: org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)

2.1.5.2 修改hudi-utilities-bundle的pom文件

目的:排除低版本jetty,添加hudi指定版本的jetty

位置:vim /opt/apps/hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml(在405行的位置))

     <!-- Hoodie -->
    <dependency>
      <groupId>org.apache.hudi</groupId>
      <artifactId>hudi-common</artifactId>
      <version>$project.version</version>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>org.apache.hudi</groupId>
      <artifactId>hudi-client-common</artifactId>
      <version>$project.version</version>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!-- Hive -->
    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-service</artifactId>
      <version>$hive.version</version>
      <scope>$utilities.bundle.hive.scope</scope>
      <exclusions>
       <exclusion>
          <artifactId>servlet-api</artifactId>
          <groupId>javax.servlet</groupId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-service-rpc</artifactId>
      <version>$hive.version</version>
      <scope>$utilities.bundle.hive.scope</scope>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>$hive.version</version>
      <scope>$utilities.bundle.hive.scope</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-metastore</artifactId>
      <version>$hive.version</version>
      <scope>$utilities.bundle.hive.scope</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>$hive.groupid</groupId>
      <artifactId>hive-common</artifactId>
      <version>$hive.version</version>
      <scope>$utilities.bundle.hive.scope</scope>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty.orbit</groupId>
          <artifactId>javax.servlet</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion<

以上是关于数据湖架构HudiHudi版本0.12源码编译Hudi集成spark使用IDEA与spark对hudi表增删改查的主要内容,如果未能解决你的问题,请参考以下文章

数据湖架构HudiHudi核心概念

数据湖之Hudi源码编译

centos7源码安装MySQL8.0.12

数据湖:什么是Hudi

数据湖之Hudi:Hudi与Spark和HDFS的集成安装使用

数据湖引擎-dremio-白话数据架构