Spark集成Hive

Posted strongmore

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark集成Hive相关的知识,希望对你有一定的参考价值。

命令行集成Hive

将hive中的hive-site.xml配置文件拷贝到spark配置文件目录下,仅需要以下内容

<configuration>
  <property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:mysql://ip:port/hive?serverTimezone=Asia/Shanghai</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>root</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>xxx</value>
  </property>
</configuration>

将hive中lib下的mysql渠道包拷贝到spark的jars目录下

bin/spark-sql

这样就可以像操作hive一样操作spark-sql了。

insert into tb_spark(name,age) values(\'lisi\',23); # hive写法
insert into tb_spark values(\'lisi\',23); # sparksql写法

插入数据时不能指定列名,原因未知,可能版本的问题。

代码集成Hive

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.4.3</version>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.29</version>
</dependency>
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
  * sparkSQL操作hive
  */
object SparkSQLReadHive 

  def main(args: Array[String]): Unit = 
    val conf = new SparkConf()
      .setMaster("local")

    val sparkSession = SparkSession.builder()
      .appName("SparkSQLReadHive")
      .config(conf)
      .config("spark.sql.warehouse.dir", "hdfs://bigdata01:9000/user/hive/warehouse")
      .enableHiveSupport()
      .getOrCreate()
    
    sparkSession.sql("select * from student").show()

    sparkSession.stop()
  

报错

Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: (null) entry in command string: null ls -F C:\\tmp\\hive
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
	at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:587)
	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:562)
	at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
	at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
	at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
	at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)

解决方法

  1. 在本地下载hadoop并解压
  2. 并下载 winutils.exe,放到hadoop的bin目录下。
  3. 配置HADOOP_HOME环境变量或者在代码中配置
    System.setProperty("hadoop.home.dir","C:\\\\D-myfiles\\\\software\\\\hadoop-3.2.0\\\\hadoop-3.2.0")
    

又报错

Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)

参考网上,要执行

winutils.exe chmod 777 C:\\tmp\\hive

但报错

由于找不到MSVCR100.dll,无法继续执行代码

太麻烦,暂时先不管它了。

参考

解决windows上The root scratch dir: /tmp/hive on HDFS should be writable.Current permissions are: ------
本地spark连接hive相关问题总结

通过配置hive-site.xml文件实现Hive集成Spark

通过配置hive-site.xml文件实现Hive集成Spark

配置前

[root@node1 ~]# cd /export/server/spark-2.4.5-bin-hadoop2.7/
[root@node1 spark-2.4.5-bin-hadoop2.7]# ll
[root@node1 spark-2.4.5-bin-hadoop2.7]# cd bin/
[root@node1 bin]# ll
总用量 112
-rwxr-xr-x 1 user1 user1 1089 23 2020 beeline
-rw-r--r-- 1 user1 user1 1064 23 2020 beeline.cmd
-rwxr-xr-x 1 user1 user1 5440 23 2020 docker-image-tool.sh
-rwxr-xr-x 1 user1 user1 1933 23 2020 find-spark-home
-rw-r--r-- 1 user1 user1 2681 23 2020 find-spark-home.cmd
-rw-r--r-- 1 user1 user1 1892 23 2020 load-spark-env.cmd
-rw-r--r-- 1 user1 user1 2025 23 2020 load-spark-env.sh
-rwxr-xr-x 1 user1 user1 2987 23 2020 pyspark
-rw-r--r-- 1 user1 user1 1540 23 2020 pyspark2.cmd
-rw-r--r-- 1 user1 user1 1170 23 2020 pyspark.cmd
-rwxr-xr-x 1 user1 user1 1030 23 2020 run-example
-rw-r--r-- 1 user1 user1 1223 23 2020 run-example.cmd
-rwxr-xr-x 1 user1 user1 3196 23 2020 spark-class
-rw-r--r-- 1 user1 user1 2817 23 2020 spark-class2.cmd
-rw-r--r-- 1 user1 user1 1180 23 2020 spark-class.cmd
-rwxr-xr-x 1 user1 user1 1039 23 2020 sparkR
-rw-r--r-- 1 user1 user1 1097 23 2020 sparkR2.cmd
-rw-r--r-- 1 user1 user1 1168 23 2020 sparkR.cmd
-rwxr-xr-x 1 user1 user1 3122 23 2020 spark-shell
-rw-r--r-- 1 user1 user1 1818 23 2020 spark-shell2.cmd
-rw-r--r-- 1 user1 user1 1178 23 2020 spark-shell.cmd
-rwxr-xr-x 1 user1 user1 1065 23 2020 spark-sql
-rw-r--r-- 1 user1 user1 1118 23 2020 spark-sql2.cmd
-rw-r--r-- 1 user1 user1 1173 23 2020 spark-sql.cmd
-rwxr-xr-x 1 user1 user1 1040 23 2020 spark-submit
-rw-r--r-- 1 user1 user1 1155 23 2020 spark-submit2.cmd
-rw-r--r-- 1 user1 user1 1180 23 2020 spark-submit.cmd
[root@node1 bin]# spark-sql 
21/09/02 16:32:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/09/02 16:32:26 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
21/09/02 16:32:26 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
Spark master: local[*], Application Id: local-1630571548369
spark-sql> show databases;
default
Time taken: 2.785 seconds, Fetched 1 row(s)
spark-sql> use d
data           date           date(          date_add(      date_format(   date_sub(      datediff(      datetime       
day(           dayofmonth(    decimal(       decode(        defined        degrees(       delimited      dense_rank(    
desc           describe       directory      distinct       distribute     div(           double         double(        
drop           
spark-sql> use default
         > ;
21/09/02 16:33:01 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Time taken: 0.034 seconds
spark-sql> exit;
[root@node1 bin]# 

集成方式

选用最简单的方式,直接把配置文件hive-site.xml放到Spark配置目录下即可,Spark会自动根据配置文件连接Hive的Thrift(当然得提前打开Hive的Thrift Server才能让Spark连接成功)。

hive-site.xml配置

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>

<property>
	<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>123456</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node3:3306/hivemetadata?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
	<name>hive.metastore.schema.verification</name>
	<value>false</value>
</property>
<property>
	<name>datanucleus.schema.autoCreateAll</name>
	<value>true</value>
</property>
<property>
	<name>hive.server2.thrift.bind.host</name>
<value>node3</value>
</property>

<property>
    <name>hive.metastore.uris</name>
    <value>thrift://node3:9083</value>
</property>

</configuration>

集成

[root@node1 spark-2.4.5-bin-hadoop2.7]# cd /export/server/spark-2.4.5-bin-hadoop2.7/conf
[root@node1 conf]# ll
总用量 36
-rw-r--r-- 1 user1 user1  996 23 2020 docker.properties.template
-rw-r--r-- 1 user1 user1 1105 23 2020 fairscheduler.xml.template
-rw-r--r-- 1 user1 user1 2059 819 16:49 log4j.properties
-rw-r--r-- 1 user1 user1 7801 23 2020 metrics.properties.template
-rw-r--r-- 1 user1 user1  885 819 16:36 slaves
-rw-r--r-- 1 user1 user1 1502 819 17:41 spark-defaults.conf
-rwxr-xr-x 1 user1 user1 4705 819 16:42 spark-env.sh
[root@node1 conf]# rz
rz waiting to receive.
Starting zmodem transfer.  Press Ctrl+C to cancel.
Transferring hive-site.xml...
  100%       1 KB       1 KB/sec    00:00:01       0 Errors  

[root@node1 conf]# ll
总用量 40
-rw-r--r-- 1 user1 user1  996 23 2020 docker.properties.template
-rw-r--r-- 1 user1 user1 1105 23 2020 fairscheduler.xml.template
-rw-r--r-- 1 root  root  1844 52 19:52 hive-site.xml
-rw-r--r-- 1 user1 user1 2059 819 16:49 log4j.properties
-rw-r--r-- 1 user1 user1 7801 23 2020 metrics.properties.template
-rw-r--r-- 1 user1 user1  885 819 16:36 slaves
-rw-r--r-- 1 user1 user1 1502 819 17:41 spark-defaults.conf
-rwxr-xr-x 1 user1 user1 4705 819 16:42 spark-env.sh
[root@node1 conf]# cd ..
[root@node1 spark-2.4.5-bin-hadoop2.7]# ll
总用量 104
drwxr-xr-x 3 user1 user1  4096 92 16:32 bin
drwxr-xr-x 2 user1 user1   215 92 16:37 conf
drwxr-xr-x 5 user1 user1    50 23 2020 data
drwxr-xr-x 4 user1 user1    29 23 2020 examples
drwxr-xr-x 2 user1 user1 12288 23 2020 jars
drwxr-xr-x 4 user1 user1    38 23 2020 kubernetes
-rw-r--r-- 1 user1 user1 21371 23 2020 LICENSE
drwxr-xr-x 2 user1 user1  4096 23 2020 licenses
-rw-r--r-- 1 user1 user1 42919 23 2020 NOTICE
drwxr-xr-x 9 user1 user1   311 23 2020 python
drwxr-xr-x 3 user1 user1    17 23 2020 R
-rw-r--r-- 1 user1 user1  3756 23 2020 README.md
-rw-r--r-- 1 user1 user1   187 23 2020 RELEASE
drwxr-xr-x 2 user1 user1  4096 23 2020 sbin
drwxr-xr-x 2 user1 user1    42 23 2020 yarn
[root@node1 spark-2.4.5-bin-hadoop2.7]# cd bin/
[root@node1 bin]# ll
总用量 116
-rwxr-xr-x 1 user1 user1 1089 23 2020 beeline
-rw-r--r-- 1 user1 user1 1064 23 2020 beeline.cmd
-rw-r--r-- 1 root  root   724 92 16:32 derby.log
-rwxr-xr-x 1 user1 user1 5440 23 2020 docker-image-tool.sh
-rwxr-xr-x 1 user1 user1 1933 23 2020 find-spark-home
-rw-r--r-- 1 user1 user1 2681 23 2020 find-spark-home.cmd
-rw-r--r-- 1 user1 user1 1892 23 2020 load-spark-env.cmd
-rw-r--r-- 1 user1 user1 2025 23 2020 load-spark-env.sh
drwxr-xr-x 5 root  root   133 92 16:32 metastore_db
-rwxr-xr-x 1 user1 user1 2987 23 2020 pyspark
-rw-r--r-- 1 user1 user1 1540 23 2020 pyspark2.cmd
-rw-r--r-- 1 user1 user1 1170 23 2020 pyspark.cmd
-rwxr-xr-x 1 user1 user1 1030 23 2020 run-example
-rw-r--r-- 1 user1 user1 1223 23 2020 run-example.cmd
-rwxr-xr-x 1 user1 user1 3196 23 2020 spark-class
-rw-r--r-- 1 user1 user1 2817 23 2020 spark-class2.cmd
-rw-r--r-- 1 user1 user1 1180 23 2020 spark-class.cmd
-rwxr-xr-x 1 user1 user1 1039 23 2020 sparkR
-rw-r--r-- 1 user1 user1 1097 23 2020 sparkR2.cmd
-rw-r--r-- 1 user1 user1 1168 23 2020 sparkR.cmd
-rwxr-xr-x 1 user1 user1 3122 23 2020 spark-shell
-rw-r--r-- 1 user1 user1 1818 23 2020 spark-shell2.cmd
-rw-r--r-- 1 user1 user1 1178 23 2020 spark-shell.cmd
-rwxr-xr-x 1 user1 user1 1065 23 2020 spark-sql
-rw-r--r-- 1 user1 user1 1118 23 2020 spark-sql2.cmd
-rw-r--r-- 1 user1 user1 1173 23 2020 spark-sql.cmd
-rwxr-xr-x 1 user1 user1 1040 23 2020 spark-submit
-rw-r--r-- 1 user1 user1 1155 23 2020 spark-submit2.cmd
-rw-r--r-- 1 user1 user1 1180 23 2020 spark-submit.cmd
[root@node1 bin]# sp
spark-class   sparkR        spark-shell   spark-sql     spark-submit  splain        split         sprof         
[root@node1 bin]# spark-sql 
21/09/02 16:37:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1630571880126
spark-sql> show databases;
aaa
default
Time taken: 1.886 seconds, Fetched 2 row(s)
spark-sql> select * from aaa.test1;
1       A1
2       A2
3       A3
4       A4
5       A5
6       A6
Time taken: 1.253 seconds, Fetched 6 row(s)
spark-sql> exit;
[root@node1 bin]# 

集成后和Hive的Beeline功能差不多。

以上是关于Spark集成Hive的主要内容,如果未能解决你的问题,请参考以下文章

通过配置hive-site.xml文件实现Hive集成Spark

通过配置hive-site.xml文件实现Hive集成Spark

通过配置hive-site.xml文件实现Hive集成Spark

Spark - 结构值的 Hive 集成 - NULL 输出

spark-sql(spark sql cli)客户端集成hive

spark集成hivecontext配置