Spark集成Hive
Posted strongmore
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark集成Hive相关的知识,希望对你有一定的参考价值。
命令行集成Hive
将hive中的hive-site.xml
配置文件拷贝到spark配置文件目录下,仅需要以下内容
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://ip:port/hive?serverTimezone=Asia/Shanghai</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>xxx</value>
</property>
</configuration>
将hive中lib下的mysql渠道包拷贝到spark的jars目录下
bin/spark-sql
这样就可以像操作hive一样操作spark-sql了。
insert into tb_spark(name,age) values(\'lisi\',23); # hive写法
insert into tb_spark values(\'lisi\',23); # sparksql写法
插入数据时不能指定列名,原因未知,可能版本的问题。
代码集成Hive
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.29</version>
</dependency>
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
* sparkSQL操作hive
*/
object SparkSQLReadHive
def main(args: Array[String]): Unit =
val conf = new SparkConf()
.setMaster("local")
val sparkSession = SparkSession.builder()
.appName("SparkSQLReadHive")
.config(conf)
.config("spark.sql.warehouse.dir", "hdfs://bigdata01:9000/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
sparkSession.sql("select * from student").show()
sparkSession.stop()
报错
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: (null) entry in command string: null ls -F C:\\tmp\\hive
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:587)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:562)
at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
解决方法
- 在本地下载hadoop并解压
- 并下载 winutils.exe,放到hadoop的bin目录下。
- 配置HADOOP_HOME环境变量或者在代码中配置
System.setProperty("hadoop.home.dir","C:\\\\D-myfiles\\\\software\\\\hadoop-3.2.0\\\\hadoop-3.2.0")
又报错
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
参考网上,要执行
winutils.exe chmod 777 C:\\tmp\\hive
但报错
由于找不到MSVCR100.dll,无法继续执行代码
太麻烦,暂时先不管它了。
参考
解决windows上The root scratch dir: /tmp/hive on HDFS should be writable.Current permissions are: ------
本地spark连接hive相关问题总结
通过配置hive-site.xml文件实现Hive集成Spark
通过配置hive-site.xml文件实现Hive集成Spark
配置前
[root@node1 ~]# cd /export/server/spark-2.4.5-bin-hadoop2.7/
[root@node1 spark-2.4.5-bin-hadoop2.7]# ll
[root@node1 spark-2.4.5-bin-hadoop2.7]# cd bin/
[root@node1 bin]# ll
总用量 112
-rwxr-xr-x 1 user1 user1 1089 2月 3 2020 beeline
-rw-r--r-- 1 user1 user1 1064 2月 3 2020 beeline.cmd
-rwxr-xr-x 1 user1 user1 5440 2月 3 2020 docker-image-tool.sh
-rwxr-xr-x 1 user1 user1 1933 2月 3 2020 find-spark-home
-rw-r--r-- 1 user1 user1 2681 2月 3 2020 find-spark-home.cmd
-rw-r--r-- 1 user1 user1 1892 2月 3 2020 load-spark-env.cmd
-rw-r--r-- 1 user1 user1 2025 2月 3 2020 load-spark-env.sh
-rwxr-xr-x 1 user1 user1 2987 2月 3 2020 pyspark
-rw-r--r-- 1 user1 user1 1540 2月 3 2020 pyspark2.cmd
-rw-r--r-- 1 user1 user1 1170 2月 3 2020 pyspark.cmd
-rwxr-xr-x 1 user1 user1 1030 2月 3 2020 run-example
-rw-r--r-- 1 user1 user1 1223 2月 3 2020 run-example.cmd
-rwxr-xr-x 1 user1 user1 3196 2月 3 2020 spark-class
-rw-r--r-- 1 user1 user1 2817 2月 3 2020 spark-class2.cmd
-rw-r--r-- 1 user1 user1 1180 2月 3 2020 spark-class.cmd
-rwxr-xr-x 1 user1 user1 1039 2月 3 2020 sparkR
-rw-r--r-- 1 user1 user1 1097 2月 3 2020 sparkR2.cmd
-rw-r--r-- 1 user1 user1 1168 2月 3 2020 sparkR.cmd
-rwxr-xr-x 1 user1 user1 3122 2月 3 2020 spark-shell
-rw-r--r-- 1 user1 user1 1818 2月 3 2020 spark-shell2.cmd
-rw-r--r-- 1 user1 user1 1178 2月 3 2020 spark-shell.cmd
-rwxr-xr-x 1 user1 user1 1065 2月 3 2020 spark-sql
-rw-r--r-- 1 user1 user1 1118 2月 3 2020 spark-sql2.cmd
-rw-r--r-- 1 user1 user1 1173 2月 3 2020 spark-sql.cmd
-rwxr-xr-x 1 user1 user1 1040 2月 3 2020 spark-submit
-rw-r--r-- 1 user1 user1 1155 2月 3 2020 spark-submit2.cmd
-rw-r--r-- 1 user1 user1 1180 2月 3 2020 spark-submit.cmd
[root@node1 bin]# spark-sql
21/09/02 16:32:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/09/02 16:32:26 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
21/09/02 16:32:26 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
Spark master: local[*], Application Id: local-1630571548369
spark-sql> show databases;
default
Time taken: 2.785 seconds, Fetched 1 row(s)
spark-sql> use d
data date date( date_add( date_format( date_sub( datediff( datetime
day( dayofmonth( decimal( decode( defined degrees( delimited dense_rank(
desc describe directory distinct distribute div( double double(
drop
spark-sql> use default
> ;
21/09/02 16:33:01 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Time taken: 0.034 seconds
spark-sql> exit;
[root@node1 bin]#
集成方式
选用最简单的方式,直接把配置文件hive-site.xml
放到Spark配置目录下即可,Spark会自动根据配置文件连接Hive的Thrift
(当然得提前打开Hive的Thrift Server才能让Spark连接成功)。
hive-site.xml配置
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node3:3306/hivemetadata?createDatabaseIfNotExist=true&useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>node3</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://node3:9083</value>
</property>
</configuration>
集成
[root@node1 spark-2.4.5-bin-hadoop2.7]# cd /export/server/spark-2.4.5-bin-hadoop2.7/conf
[root@node1 conf]# ll
总用量 36
-rw-r--r-- 1 user1 user1 996 2月 3 2020 docker.properties.template
-rw-r--r-- 1 user1 user1 1105 2月 3 2020 fairscheduler.xml.template
-rw-r--r-- 1 user1 user1 2059 8月 19 16:49 log4j.properties
-rw-r--r-- 1 user1 user1 7801 2月 3 2020 metrics.properties.template
-rw-r--r-- 1 user1 user1 885 8月 19 16:36 slaves
-rw-r--r-- 1 user1 user1 1502 8月 19 17:41 spark-defaults.conf
-rwxr-xr-x 1 user1 user1 4705 8月 19 16:42 spark-env.sh
[root@node1 conf]# rz
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring hive-site.xml...
100% 1 KB 1 KB/sec 00:00:01 0 Errors
[root@node1 conf]# ll
总用量 40
-rw-r--r-- 1 user1 user1 996 2月 3 2020 docker.properties.template
-rw-r--r-- 1 user1 user1 1105 2月 3 2020 fairscheduler.xml.template
-rw-r--r-- 1 root root 1844 5月 2 19:52 hive-site.xml
-rw-r--r-- 1 user1 user1 2059 8月 19 16:49 log4j.properties
-rw-r--r-- 1 user1 user1 7801 2月 3 2020 metrics.properties.template
-rw-r--r-- 1 user1 user1 885 8月 19 16:36 slaves
-rw-r--r-- 1 user1 user1 1502 8月 19 17:41 spark-defaults.conf
-rwxr-xr-x 1 user1 user1 4705 8月 19 16:42 spark-env.sh
[root@node1 conf]# cd ..
[root@node1 spark-2.4.5-bin-hadoop2.7]# ll
总用量 104
drwxr-xr-x 3 user1 user1 4096 9月 2 16:32 bin
drwxr-xr-x 2 user1 user1 215 9月 2 16:37 conf
drwxr-xr-x 5 user1 user1 50 2月 3 2020 data
drwxr-xr-x 4 user1 user1 29 2月 3 2020 examples
drwxr-xr-x 2 user1 user1 12288 2月 3 2020 jars
drwxr-xr-x 4 user1 user1 38 2月 3 2020 kubernetes
-rw-r--r-- 1 user1 user1 21371 2月 3 2020 LICENSE
drwxr-xr-x 2 user1 user1 4096 2月 3 2020 licenses
-rw-r--r-- 1 user1 user1 42919 2月 3 2020 NOTICE
drwxr-xr-x 9 user1 user1 311 2月 3 2020 python
drwxr-xr-x 3 user1 user1 17 2月 3 2020 R
-rw-r--r-- 1 user1 user1 3756 2月 3 2020 README.md
-rw-r--r-- 1 user1 user1 187 2月 3 2020 RELEASE
drwxr-xr-x 2 user1 user1 4096 2月 3 2020 sbin
drwxr-xr-x 2 user1 user1 42 2月 3 2020 yarn
[root@node1 spark-2.4.5-bin-hadoop2.7]# cd bin/
[root@node1 bin]# ll
总用量 116
-rwxr-xr-x 1 user1 user1 1089 2月 3 2020 beeline
-rw-r--r-- 1 user1 user1 1064 2月 3 2020 beeline.cmd
-rw-r--r-- 1 root root 724 9月 2 16:32 derby.log
-rwxr-xr-x 1 user1 user1 5440 2月 3 2020 docker-image-tool.sh
-rwxr-xr-x 1 user1 user1 1933 2月 3 2020 find-spark-home
-rw-r--r-- 1 user1 user1 2681 2月 3 2020 find-spark-home.cmd
-rw-r--r-- 1 user1 user1 1892 2月 3 2020 load-spark-env.cmd
-rw-r--r-- 1 user1 user1 2025 2月 3 2020 load-spark-env.sh
drwxr-xr-x 5 root root 133 9月 2 16:32 metastore_db
-rwxr-xr-x 1 user1 user1 2987 2月 3 2020 pyspark
-rw-r--r-- 1 user1 user1 1540 2月 3 2020 pyspark2.cmd
-rw-r--r-- 1 user1 user1 1170 2月 3 2020 pyspark.cmd
-rwxr-xr-x 1 user1 user1 1030 2月 3 2020 run-example
-rw-r--r-- 1 user1 user1 1223 2月 3 2020 run-example.cmd
-rwxr-xr-x 1 user1 user1 3196 2月 3 2020 spark-class
-rw-r--r-- 1 user1 user1 2817 2月 3 2020 spark-class2.cmd
-rw-r--r-- 1 user1 user1 1180 2月 3 2020 spark-class.cmd
-rwxr-xr-x 1 user1 user1 1039 2月 3 2020 sparkR
-rw-r--r-- 1 user1 user1 1097 2月 3 2020 sparkR2.cmd
-rw-r--r-- 1 user1 user1 1168 2月 3 2020 sparkR.cmd
-rwxr-xr-x 1 user1 user1 3122 2月 3 2020 spark-shell
-rw-r--r-- 1 user1 user1 1818 2月 3 2020 spark-shell2.cmd
-rw-r--r-- 1 user1 user1 1178 2月 3 2020 spark-shell.cmd
-rwxr-xr-x 1 user1 user1 1065 2月 3 2020 spark-sql
-rw-r--r-- 1 user1 user1 1118 2月 3 2020 spark-sql2.cmd
-rw-r--r-- 1 user1 user1 1173 2月 3 2020 spark-sql.cmd
-rwxr-xr-x 1 user1 user1 1040 2月 3 2020 spark-submit
-rw-r--r-- 1 user1 user1 1155 2月 3 2020 spark-submit2.cmd
-rw-r--r-- 1 user1 user1 1180 2月 3 2020 spark-submit.cmd
[root@node1 bin]# sp
spark-class sparkR spark-shell spark-sql spark-submit splain split sprof
[root@node1 bin]# spark-sql
21/09/02 16:37:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1630571880126
spark-sql> show databases;
aaa
default
Time taken: 1.886 seconds, Fetched 2 row(s)
spark-sql> select * from aaa.test1;
1 A1
2 A2
3 A3
4 A4
5 A5
6 A6
Time taken: 1.253 seconds, Fetched 6 row(s)
spark-sql> exit;
[root@node1 bin]#
集成后和Hive的Beeline功能差不多。
以上是关于Spark集成Hive的主要内容,如果未能解决你的问题,请参考以下文章
通过配置hive-site.xml文件实现Hive集成Spark
通过配置hive-site.xml文件实现Hive集成Spark
通过配置hive-site.xml文件实现Hive集成Spark
Spark - 结构值的 Hive 集成 - NULL 输出