Spark SQL 使用beeline访问hive仓库

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark SQL 使用beeline访问hive仓库相关的知识,希望对你有一定的参考价值。

一、添加hive-site.xml

在$SPARK_HOME/conf下添加hive-site.xml的配置文件,目的是能正常访问hive的元数据

vim hive-site.xml
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://192.168.1.201:3306/hiveDB?createDatabaseIfNotExist=true</value>
        </property>

    <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
            <value>root</value>
        </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
            <value>123456</value>
        </property>
        <!-- hive查询时输出列名 -->
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>
    <!-- 显示当前数据库名 -->
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
    </property>
</configuration>

注意:在节点上不需要部署hive,只要是你可以连接到hive的元数据就可以!

二、启动thriftserver服务

[[email protected] spark]$ ./sbin/start-thriftserver.sh --jars ~/softwares/mysql-connector-java-5.1.47.jar 
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, 
logging to /home/hadoop/app/spark/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop003.out

检查日志,确认thriftserver服务正常启动

[[email protected] spark]$ tail -50f /home/hadoop/app/spark/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop003.out

19/05/21 09:39:14 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/05/21 09:39:15 INFO metastore.ObjectStore: Initialized ObjectStore
19/05/21 09:39:15 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
19/05/21 09:39:15 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
19/05/21 09:39:15 INFO metastore.HiveMetaStore: Added admin role in metastore
19/05/21 09:39:15 INFO metastore.HiveMetaStore: Added public role in metastore
19/05/21 09:39:15 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
19/05/21 09:39:15 INFO metastore.HiveMetaStore: 0: get_all_databases
19/05/21 09:39:15 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_all_databases   
19/05/21 09:39:15 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
19/05/21 09:39:15 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:16 INFO session.SessionState: Created local directory: /tmp/73df82dd-1fd3-4dd5-97f1-680d53bd44bc_resources
19/05/21 09:39:16 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc
19/05/21 09:39:16 INFO session.SessionState: Created local directory: /tmp/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc
19/05/21 09:39:16 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc/_tmp_space.db
19/05/21 09:39:16 INFO client.HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/home/hadoop/app/spark-2.4.2-bin-hadoop-2.6.0-cdh5.7.0/spark-warehouse
19/05/21 09:39:16 INFO session.SessionManager: Operation log root directory is created: /tmp/hadoop/operation_logs
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread pool size: 100
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread wait queue size: 100
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread keepalive time: 10 seconds
19/05/21 09:39:16 INFO service.AbstractService: Service:OperationManager is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:SessionManager is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service: CLIService is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:ThriftBinaryCLIService is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service: HiveServer2 is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:OperationManager is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:SessionManager is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:CLIService is started.
19/05/21 09:39:16 INFO metastore.ObjectStore: ObjectStore, initialize called
19/05/21 09:39:16 INFO DataNucleus.Query: Reading in results for query "[email protected]" since the connection used is closing
19/05/21 09:39:16 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/05/21 09:39:16 INFO metastore.ObjectStore: Initialized ObjectStore
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: get_databases: default
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_databases: default  
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: Shutting down the object store...
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=Shutting down the object store...   
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: Metastore shutdown complete.
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=Metastore shutdown complete.    
19/05/21 09:39:16 INFO service.AbstractService: Service:ThriftBinaryCLIService is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:HiveServer2 is started.
19/05/21 09:39:16 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver/json,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver/session,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started [email protected]{/sqlserver/session/json,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO thrift.ThriftCLIService: 
Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads#标志启动成功

三、启动beeline

[[email protected] spark]$ ./bin/beeline -u jdbc:hive2://localhost:10000 -n hadoop
Connecting to jdbc:hive2://localhost:10000
19/05/21 09:46:19 INFO jdbc.Utils: Supplied authorities: localhost:10000
19/05/21 09:46:19 INFO jdbc.Utils: Resolved authority: localhost:10000
19/05/21 09:46:19 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.4.2)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:10000> select * from student.student limit 5;
+---------+-----------+-----------------+--------------------------------------------+--+
| stu_id  | stu_name  |  stu_phone_num  |                 stu_email                  |
+---------+-----------+-----------------+--------------------------------------------+--+
| 1       | Burke     | 1-300-746-8446  | [email protected]  |
| 2       | Kamal     | 1-668-571-5046  | [email protected]          |
| 3       | Olga      | 1-956-311-1686  | [email protected]     |
| 4       | Belle     | 1-246-894-6340  | [email protected]              |
| 5       | Trevor    | 1-300-527-4967  | [email protected]             |
+---------+-----------+-----------------+--------------------------------------------+--+
5 rows selected (3.275 seconds)
0: jdbc:hive2://localhost:10000> 

启动成功

四、注意

1、最好在spark/bin目录下启动beeline
因为如果你启动sparkbeeline的机器还部署了hive,恰巧你的hive环境变量正好在spark环境变量之前,那么很可能启动的是hive的beeline
比如:

[[email protected] spark]$ beeline
ls: cannot access /home/hadoop/app/spark/lib/spark-assembly-*.jar: No such file or directory
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/spark/bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0//bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0//sbin:/home/hadoop/app/zookeeper/bin:/usr/java/jdk1.8.0_131/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
Beeline version 1.1.0-cdh5.7.0 by Apache Hive  # 这不就是hive么
beeline> 

此时你查看下环境变量

[[email protected] spark]$ cat ~/.bash_profile 
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin

export PATH
#####JAVA_HOME#####
export JAVA_HOME=/usr/java/jdk1.8.0_131

####ZOOKEEPER_HOME####
export ZOOKEEPER_HOME=/home/hadoop/app/zookeeper

#####HADOOP_HOME######
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/

export SPARK_HOME=/home/hadoop/app/spark

#####HIVE_HOME#####
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$HIVE_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$JAVA_HOME/bin:$PATH

果然如果不指定beeline路径就会优先使用hive的beeline

以上是关于Spark SQL 使用beeline访问hive仓库的主要内容,如果未能解决你的问题,请参考以下文章

sparksql怎么批量删除分区

hive beeline详解

原创问题定位分享(18)beeline连接spark thrift有时会卡住

beeline使用小节

SparkThriftServer 源码分析

spark、hive、impala、hdfs的常用命令