在 Spark SQL 中找不到 Hive 表 - Cloudera VM 中的 spark.sql.AnalysisException

Posted

技术标签:

【中文标题】在 Spark SQL 中找不到 Hive 表 - Cloudera VM 中的 spark.sql.AnalysisException【英文标题】:Hive tables not found in Spark SQL - spark.sql.AnalysisException in Cloudera VM 【发布时间】:2017-04-25 19:15:54 【问题描述】:

我正在尝试通过 java 程序访问 Hive 表,但看起来我的程序在默认数据库中没有看到任何表。但是,我可以看到相同的表并通过 spark-shell 查询它们。我已经在 spark conf 目录中复制了 hive-site.xml。唯一的区别 - spark-shell 运行的是 spark 1.6.0 版,而我的 java 程序运行的是 Spark 2.1.0

package spark_210_test;

import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class SparkTest 

private static SparkConf sparkConf;
private static SparkSession sparkSession;

public static void main(String[] args) 
    String warehouseLocation = "hdfs://quickstart.cloudera/user/hive/warehouse/";
    sparkConf = new SparkConf().setAppName("Hive Test").setMaster("local[*]")
            .set("spark.sql.warehouse.dir", warehouseLocation);

    sparkSession = SparkSession
      .builder()
      .config(sparkConf)
      .enableHiveSupport()
      .getOrCreate();

    Dataset<Row> df0 = sparkSession.sql("show tables");
    List<Row> currentTablesList = df0.collectAsList();
    if (currentTablesList.size() > 0) 
        for (int i=0; i< currentTablesList.size(); i++) 
            String table = currentTablesList.get(i).getAs("name");
            System.out.printf("%s, ", table);
        
    
    else System.out.printf("No Table found for %s.\n", warehouseLocation);

    Dataset<Row> dfCount = sparkSession.sql("select count(*) from sample_07");
    System.out.println(dfCount.collect().toString());


输出似乎没有从 hive 仓库中读取任何内容 线程“主”org.apache.spark.sql.AnalysisException 中的异常:找不到表或视图:sample_07;第 1 行第 21 行

下面给出了整个输出

    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/cloudera/workspace/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/cloudera/workspace/PortalHandlerTest.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/cloudera/workspace/SparkTest.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/cloudera/workspace/JARs/slf4j-log4j12-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/04/25 12:01:51 INFO SparkContext: Running Spark version 2.1.0
17/04/25 12:01:51 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/04/25 12:01:51 WARN SparkContext: Support for Scala 2.10 is deprecated as of Spark 2.1.0
17/04/25 12:01:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/25 12:01:52 INFO SecurityManager: Changing view acls to: cloudera
17/04/25 12:01:52 INFO SecurityManager: Changing modify acls to: cloudera
17/04/25 12:01:52 INFO SecurityManager: Changing view acls groups to: 
17/04/25 12:01:52 INFO SecurityManager: Changing modify acls groups to: 
17/04/25 12:01:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(cloudera); groups with view permissions: Set(); users  with modify permissions: Set(cloudera); groups with modify permissions: Set()
17/04/25 12:01:53 INFO Utils: Successfully started service 'sparkDriver' on port 50644.
17/04/25 12:01:53 INFO SparkEnv: Registering MapOutputTracker
17/04/25 12:01:53 INFO SparkEnv: Registering BlockManagerMaster
17/04/25 12:01:53 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/04/25 12:01:53 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/04/25 12:01:53 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-f44e093c-d9a9-42ad-8f5f-9e21b99f0e45
17/04/25 12:01:53 INFO MemoryStore: MemoryStore started with capacity 375.7 MB
17/04/25 12:01:53 INFO SparkEnv: Registering OutputCommitCoordinator
17/04/25 12:01:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/04/25 12:01:54 INFO Utils: Successfully started service 'SparkUI' on port 4041.
17/04/25 12:01:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4041
17/04/25 12:01:54 INFO Executor: Starting executor ID driver on host localhost
17/04/25 12:01:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43409.
17/04/25 12:01:54 INFO NettyBlockTransferService: Server created on 10.0.2.15:43409
17/04/25 12:01:54 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/04/25 12:01:54 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 43409, None)
17/04/25 12:01:54 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:43409 with 375.7 MB RAM, BlockManagerId(driver, 10.0.2.15, 43409, None)
17/04/25 12:01:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 43409, None)
17/04/25 12:01:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 43409, None)
17/04/25 12:01:54 INFO SharedState: Warehouse path is 'hdfs://quickstart.cloudera/user/hive/warehouse/'.
17/04/25 12:01:54 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/04/25 12:01:55 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
17/04/25 12:01:55 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
17/04/25 12:01:55 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/04/25 12:01:55 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
17/04/25 12:01:55 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
17/04/25 12:01:55 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
17/04/25 12:01:55 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/04/25 12:01:55 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
17/04/25 12:01:57 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/04/25 12:01:57 INFO ObjectStore: ObjectStore, initialize called
17/04/25 12:01:57 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/04/25 12:01:57 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/04/25 12:02:01 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/04/25 12:02:04 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/04/25 12:02:04 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/04/25 12:02:04 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/04/25 12:02:04 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/04/25 12:02:05 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
17/04/25 12:02:05 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/04/25 12:02:05 INFO ObjectStore: Initialized ObjectStore
17/04/25 12:02:05 INFO HiveMetaStore: Added admin role in metastore
17/04/25 12:02:05 INFO HiveMetaStore: Added public role in metastore
17/04/25 12:02:05 INFO HiveMetaStore: No user is added in admin role, since config is empty
17/04/25 12:02:06 INFO HiveMetaStore: 0: get_all_databases
17/04/25 12:02:06 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_all_databases   
17/04/25 12:02:06 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/04/25 12:02:06 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
17/04/25 12:02:06 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
17/04/25 12:02:07 INFO SessionState: Created local directory: /tmp/135d2e8d-2300-4f62-b445-ec6e8b0461a7_resources
17/04/25 12:02:07 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/135d2e8d-2300-4f62-b445-ec6e8b0461a7
17/04/25 12:02:07 INFO SessionState: Created local directory: /tmp/cloudera/135d2e8d-2300-4f62-b445-ec6e8b0461a7
17/04/25 12:02:07 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/135d2e8d-2300-4f62-b445-ec6e8b0461a7/_tmp_space.db
17/04/25 12:02:07 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs://quickstart.cloudera/user/hive/warehouse/
17/04/25 12:02:07 INFO HiveMetaStore: 0: get_database: default
17/04/25 12:02:07 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_database: default   
17/04/25 12:02:07 INFO HiveMetaStore: 0: get_database: global_temp
17/04/25 12:02:07 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_database: global_temp   
17/04/25 12:02:07 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
17/04/25 12:02:08 INFO SparkSqlParser: Parsing command: show tables
17/04/25 12:02:12 INFO HiveMetaStore: 0: get_database: default
17/04/25 12:02:12 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_database: default   
17/04/25 12:02:12 INFO HiveMetaStore: 0: get_database: default
17/04/25 12:02:12 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_database: default   
17/04/25 12:02:12 INFO HiveMetaStore: 0: get_tables: db=default pat=*
17/04/25 12:02:12 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_tables: db=default pat=*    
No Table found for hdfs://quickstart.cloudera/user/hive/warehouse/.
17/04/25 12:02:13 INFO SparkSqlParser: Parsing command: select count(*) from sample_07
17/04/25 12:02:13 INFO HiveMetaStore: 0: get_table : db=default tbl=sample_07
17/04/25 12:02:13 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_table : db=default tbl=sample_07    
17/04/25 12:02:13 INFO HiveMetaStore: 0: get_table : db=default tbl=sample_07
17/04/25 12:02:13 INFO audit: ugi=cloudera  ip=unknown-ip-addr  cmd=get_table : db=default tbl=sample_07    
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: sample_07; line 1 pos 21
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:459)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:478)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:463)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
    at spark_210_test.SparkTest.main(SparkTest.java:35)
17/04/25 12:02:13 INFO SparkContext: Invoking stop() from shutdown hook
17/04/25 12:02:13 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4041
17/04/25 12:02:13 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/04/25 12:02:13 INFO MemoryStore: MemoryStore cleared
17/04/25 12:02:13 INFO BlockManager: BlockManager stopped
17/04/25 12:02:13 INFO BlockManagerMaster: BlockManagerMaster stopped
17/04/25 12:02:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/04/25 12:02:13 INFO SparkContext: Successfully stopped SparkContext
17/04/25 12:02:13 INFO ShutdownHookManager: Shutdown hook called
17/04/25 12:02:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-7c1cfc73-34b9-463d-b12a-5cbcb832b0f8

以防万一,我的 pom.xml 在下面

<project xmlns="http://maven.apache.org/POM/4.0.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>spark_test_210</groupId>
  <artifactId>spark_test_210</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <dependencies>
    <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.10</artifactId>
    <version>2.1.0</version>
    </dependency></dependencies>
  <build>
    <sourceDirectory>src</sourceDirectory>
  </build>
</project>

任何帮助将不胜感激

【问题讨论】:

请提供您的 spark-submit 命令。您是在集群模式下部署吗? 我正在从这个项目创建一个 jar,并从命令行执行它。最终目标是从其他程序中调用类似的程序。 jar -cp "/usr/lib/hadoop/lib/*:/usr/lib/spark/lib/*" spark_210_test.SparkTest 这是在 cloudera 虚拟机上。 尝试将您的 spark conf 目录添加到类路径中。您可以通过检查 Spark History Server UI 中的“环境”选项卡来判断它是否正常工作。 你知道 spark-shell 调用了$SPARK_HOME/bin 中的一堆其他shell 脚本,这些脚本处理$SPARK_HOME/conf 中的配置文件,然后构建适当的Java 命令-行参数(包括适当的 CLASSPATH)?您是否知道 Hadoop 库希望在 CLASSPATH 中的目录中找到它们的配置文件,并带有硬编码的名称? 感谢 Paul 和 Samson 的回复。我已经通过 hive URL 和 impala 使用 JDBC 进行 Hive 查询。这纯粹是出于教育目的,我还试图检查 spark sql。但正如你们俩所提到的,问题确实在于设置类路径。将 /usr/lib/spark/conf 添加到类路径后,我可以通过 spark SQL 查看 hive 中的所有表。 【参考方案1】:

需要几个步骤。

    使用 SparkSession.enableHiveSupport() 代替已弃用的 SQLContext 或 HiveContext。 将 hive-site.xml 复制到 SPARK CONF (/usr/lib/spark/conf) 目录中 在执行 jar 时将相同的目录添加到类路径(感谢上面的 Paul 和 Samson)

【讨论】:

【参考方案2】:

按照 Joydeep 所说的上述步骤,还将数据库名称与表或视图名称一起添加前缀。喜欢 employee.emp

import java.io.File;

import org.apache.spark.sql.SparkSession;

/**
 * @author dinesh.lomte
 *
 */
public class SparkHiveExample 

    public static void main(String... args) 
        String warehouseLocation = new File("spark-employee").getAbsolutePath();
        SparkSession spark = SparkSession.builder().appName("Java Spark Hive Example").master("local")
                .config("spark.sql.employee.dir", warehouseLocation).enableHiveSupport().getOrCreate();

        spark.sql("SELECT * FROM employee.emp").show();
    

【讨论】:

以上是关于在 Spark SQL 中找不到 Hive 表 - Cloudera VM 中的 spark.sql.AnalysisException的主要内容,如果未能解决你的问题,请参考以下文章

如果在 Hive SQL 中找不到匹配项,则提供要加入的备用列

在 org.apache.spark.sql.types.DataTypes 中找不到 uuid

Spark sql 在yarn-cluster模式下找不到表

无法在Postgres中实例化外部Hive Metastore /在类路径中找不到驱动程序

spark sql怎么去获取hive 表一定日期范围内的数据

源码级解读如何解决Spark-sql读取hive分区表执行效率低问题