Spark saveAsTable append 将数据保存到 hive 但抛出错误:org.apache.hadoop.hive.ql.metadata.Hive.alterTable

Posted

技术标签:

【中文标题】Spark saveAsTable append 将数据保存到 hive 但抛出错误:org.apache.hadoop.hive.ql.metadata.Hive.alterTable【英文标题】:Spark saveAsTable append saves data to hive but throws an error: org.apache.hadoop.hive.ql.metadata.Hive.alterTable 【发布时间】:2021-08-08 18:21:13 【问题描述】:

我正在尝试将数据附加到配置单元中的现有表中。但是当我打电话时

sdf.write.format("parquet").mode("append").saveAsTable("db.tbl", path=hdfs_path)

数据已成功保存,但出现此错误:

Py4JJavaError: An error occurred while calling o152.saveAsTable.
: java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.alterTable(java.lang.String, org.apache.hadoop.hive.ql.metadata.Table, org.apache.hadoop.hive.metastore.api.EnvironmentContext)
    at java.lang.Class.getMethod(Class.java:1786)
    at org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:177)
    at org.apache.spark.sql.hive.client.Shim_v2_1.alterTableMethod$lzycompute(HiveShim.scala:1183)
    at org.apache.spark.sql.hive.client.Shim_v2_1.alterTableMethod(HiveShim.scala:1177)
    at org.apache.spark.sql.hive.client.Shim_v2_1.alterTable(HiveShim.scala:1230)
    at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$alterTable$1(HiveClientImpl.scala:572)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
    at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
    at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
    at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
    at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:562)
    at org.apache.spark.sql.hive.client.HiveClient.alterTable(HiveClient.scala:107)
    at org.apache.spark.sql.hive.client.HiveClient.alterTable$(HiveClient.scala:106)
    at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:90)
    at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$alterTableStats$1(HiveExternalCatalog.scala:719)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
    at org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:705)
    at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.alterTableStats(ExternalCatalogWithListener.scala:133)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:420)
    at org.apache.spark.sql.execution.command.CommandUtils$.updateTableStats(CommandUtils.scala:63)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:198)
    at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:538)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:219)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:167)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
    at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:727)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:705)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:603)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

我也尝试了一些替代方案:

sdf.write.insertInto("db.tbl",overwrite=False)
sdf.write.mode("append").insertInto("db.tbl")
spark.sql("insert into table value(...)")

但同样的问题。似乎任何向现有表添加数据的尝试都会成功并引发该错误。

“覆盖”模式运行良好。

我使用的 spark 版本是 3.0.1 我使用的 hive 版本是 3.1.0

以前有人遇到过这个问题吗?

【问题讨论】:

【参考方案1】:

这看起来像 spark 3 中提到的一些 hive Metastore 工件是 hive 2.x 而不是您使用的 3.x。

【讨论】:

当我打印 spark 配置时,我得到了这个('spark.sql.hive.metastore.version', '3.0')。作为 spark.sql.hive.metastore.jars = Standalone-metastore-1.21.2.3.1.4.41-5-hive3.jar 还在 spark2-client/jars 文件夹中我发现了这个使用过的 jar:hive-metastore-1.21.2.3.1.4.41-5.jar【参考方案2】:

您的环境中肯定有错误的 Hive jar:

你的spark指的是Hive 3.x,里面有这个方法alterTable(String, Table, EnvironmentContext) 但是根据您的评论,您有hive-metastore-1.21.2.3.1.4.41-5.jar,它在Hortonwork 分发下,您可以download the source code 并验证自己,没有这种方法。

【讨论】:

以上是关于Spark saveAsTable append 将数据保存到 hive 但抛出错误:org.apache.hadoop.hive.ql.metadata.Hive.alterTable的主要内容,如果未能解决你的问题,请参考以下文章

在 Spark 的 saveAsTable 上

Spark 2.x saveAsTable

Spark 数据框 saveAsTable 正在使用单个任务

Spark:可以使用 DataFrame.saveAsTable 或 DataFrameWriter.options 传递哪些选项?

Spark 3.0 - 使用 .save() 或 .saveAsTable() 保存时的读取性能

Spark SaveAsTable 元数据更新慢