Spark 无法查询它可以看到的 Hive 表？

Posted 2023-04-17

技术标签:

【中文标题】Spark 无法查询它可以看到的 Hive 表？【英文标题】：Spark cannot query Hive tables it can see? 【发布时间】：2014-12-26 18:33:26 【问题描述】：

我正在 CentOS 上运行用于 CDH 4 的 Spark 1.2 的预构建版本。我已将 hive-site.xml 文件复制到 Spark 的 conf 目录中，因此它应该可以看到 Hive 元存储。

我在 Hive 中有三个表（facility、newpercentile、percentile），我可以从 Hive CLI 查询所有这些表。在我登录到 Spark 并像这样创建 Hive 上下文后： val hiveC = new org.apache.spark.sql.hive.HiveContext(sc) 我在查询这些表时遇到了问题。

如果我运行以下命令：val tableList = hiveC.hql("show tables") 并在 tableList 上执行 collect()，我会得到以下结果：res0: Array[org.apache.spark.sql.Row] = Array([设施], [newpercentile], [percentile])

如果我然后运行此命令来获取设施表的计数：val facTable = hiveC.hql("select count(*) from facility")，我会得到以下输出，我认为这意味着它不能找到设施表进行查询：

scala> val facTable = hiveC.hql("select count(*) from facility")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
14/12/26 10:27:26 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.

14/12/26 10:27:26 INFO ParseDriver: Parsing command: select count(*) from facility
14/12/26 10:27:26 INFO ParseDriver: Parse Completed
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(355177) called with curMem=0, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 346.9 KB, free 264.6 MB)
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(50689) called with curMem=355177, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 49.5 KB, free 264.6 MB)
14/12/26 10:27:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:45305 (size: 49.5 KB, free: 264.9 MB)
14/12/26 10:27:26 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
14/12/26 10:27:26 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:68

facTable: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[2] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==

Aggregate false, [], [Coalesce(SUM(PartialCount#38L),0) AS _c0#5L]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#38L]
   HiveTableScan [], (MetastoreRelation default, facility, None), None

我们将不胜感激。谢谢。

【问题讨论】：

【参考方案1】：

scala> val facTable = hiveC.hql("select count(*) from facility")

太棒了！你有一个 RDD，现在你想用它做什么？

scala> facTable.collect()

请记住，RDD 是数据之上的抽象，并且在您对其调用诸如collect() 或count() 之类的操作之前不会具体化。

如果您尝试使用不存在的表名，则会收到非常明显的错误。

【讨论】：

没错！现在情况看起来不错。感谢攀登。至少可以说，最后一行中的 None 列表让我感到困惑。感谢您的及时回复。

以上是关于Spark 无法查询它可以看到的 Hive 表？的主要内容，如果未能解决你的问题，请参考以下文章