数据湖架构HudiHudi版本0.12源码编译Hudi集成spark使用IDEA与spark对hudi表增删改查
Posted undo_try
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据湖架构HudiHudi版本0.12源码编译Hudi集成spark使用IDEA与spark对hudi表增删改查相关的知识,希望对你有一定的参考价值。
二、数据湖hudi快速上手
2.1 编译hudi源码
Hadoop | 3.1.3 |
---|---|
Hive | 3.1.2 |
Flink | 1.13.6,scala-2.12 |
Spark | 3.2.2,scala-2.12 |
2.1.1 环境准备
[root@centos04 bin]# mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /opt/apps/apache-maven-3.6.3
Java version: 1.8.0_141, vendor: Oracle Corporation, runtime: /opt/apps/jdk1.8.0_141/jre
Default locale: en_US, platform encoding: UTF-8
[root@centos04 bin]# java -version
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
2.1.2 下载源码包
wget http://archive.apache.org/dist/hudi/0.12.0/hudi-0.12.0.src.tgz
tar -zxvf ./hudi-0.12.0.src.tgz
[root@centos04 apps]# ll
total 4
drwxr-xr-x. 6 root root 126 Feb 28 18:12 apache-maven-3.6.3
drwxr-xr-x. 22 501 games 4096 Aug 16 2022 hudi-0.12.0
drwxr-xr-x. 8 10 143 255 Jul 12 2017 jdk1.8.0_141
2.1.3 在pom文件中新增repository加速依赖下载
# 编辑pom文件
vim /opt/apps/hudi-0.12.0/pom.xml
# 新增repository加速依赖下载
<repository>
<id>nexus-aliyun</id>
<name>nexus-aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
在pom文件中修改依赖的组件版本:
<hadoop.version>3.1.3</hadoop.version>
<hive.version>3.1.2</hive.version>
2.1.4 修改源码兼容hadoop3并添加kafka依赖
Hudi默认依赖的hadoop2,要兼容hadoop3,除了修改版本,还需要修改如下代码:
vim /opt/apps/hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java
否则会因为hadoop2.x和3.x版本兼容问题(找不到合适的FSDataOutputStream构造器)。
- 有几个kafka的依赖需要手动安装,否则编译会报错。
通过网址下载:http://packages.confluent.io/archive/5.3/confluent-5.3.4-2.12.zip
# 解压后找到以下jar包,上传编译服务器
common-config-5.3.4.jar
common-utils-5.3.4.jar
kafka-avro-serializer-5.3.4.jar
kafka-schema-registry-client-5.3.4.jar
install本地仓库
mvn install:install-file -DgroupId=io.confluent -DartifactId=common-config -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-config-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=common-utils -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-utils-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-avro-serializer -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-avro-serializer-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-schema-registry-client-5.3.4.jar
2.1.5 解决spark模块依赖冲突
修改了Hive版本为3.1.2,其携带的jetty是0.9.3,hudi本身用的0.9.4,存在依赖冲突。
2.1.5.1 修改hudi-spark-bundle的pom文件
目的:排除低版本jetty,添加hudi指定版本的jetty
pom文件位置:vim /opt/apps/hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml
(在382行的位置)
<!-- Hive -->
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-service</artifactId>
<version>$hive.version</version>
<scope>$spark.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<artifactId>guava</artifactId>
<groupId>com.google.guava</groupId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>org.pentaho</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-service-rpc</artifactId>
<version>$hive.version</version>
<scope>$spark.bundle.hive.scope</scope>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-jdbc</artifactId>
<version>$hive.version</version>
<scope>$spark.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>javax.servlet.jsp</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-metastore</artifactId>
<version>$hive.version</version>
<scope>$spark.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-core</artifactId>
</exclusion>
<exclusion>
<groupId>javax.servlet.jsp</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<artifactId>guava</artifactId>
<groupId>com.google.guava</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-common</artifactId>
<version>$hive.version</version>
<scope>$spark.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty.orbit</groupId>
<artifactId>javax.servlet</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- 增加hudi配置版本的jetty -->
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>$jetty.version</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
<version>$jetty.version</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-webapp</artifactId>
<version>$jetty.version</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-http</artifactId>
<version>$jetty.version</version>
</dependency>
否则在使用spark向hudi表插入数据时,会报错
java.lang.NoSuchMethodError: org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)
2.1.5.2 修改hudi-utilities-bundle的pom文件
目的:排除低版本jetty,添加hudi指定版本的jetty
位置:vim /opt/apps/hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml
(在405行的位置))
<!-- Hoodie -->
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-common</artifactId>
<version>$project.version</version>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-client-common</artifactId>
<version>$project.version</version>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Hive -->
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-service</artifactId>
<version>$hive.version</version>
<scope>$utilities.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<artifactId>servlet-api</artifactId>
<groupId>javax.servlet</groupId>
</exclusion>
<exclusion>
<artifactId>guava</artifactId>
<groupId>com.google.guava</groupId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>org.pentaho</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-service-rpc</artifactId>
<version>$hive.version</version>
<scope>$utilities.bundle.hive.scope</scope>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-jdbc</artifactId>
<version>$hive.version</version>
<scope>$utilities.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>javax.servlet.jsp</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-metastore</artifactId>
<version>$hive.version</version>
<scope>$utilities.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-core</artifactId>
</exclusion>
<exclusion>
<groupId>javax.servlet.jsp</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<artifactId>guava</artifactId>
<groupId>com.google.guava</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>$hive.groupid</groupId>
<artifactId>hive-common</artifactId>
<version>$hive.version</version>
<scope>$utilities.bundle.hive.scope</scope>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty.orbit</groupId>
<artifactId>javax.servlet</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion<以上是关于数据湖架构HudiHudi版本0.12源码编译Hudi集成spark使用IDEA与spark对hudi表增删改查的主要内容,如果未能解决你的问题,请参考以下文章