只能对具有兼容列类型 Spark 数据框的表执行联合

Posted 2023-04-15

技术标签:

【中文标题】只能对具有兼容列类型 Spark 数据框的表执行联合【英文标题】：Union can only be performed on tables with the compatible column types Spark dataframe 【发布时间】：2017-11-28 05:03:22 【问题描述】：

这是我的联合代码：

val dfToSave=dfMainOutput.union(insertdf.select(dfMainOutput).withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")))

当我进行联合时，出现以下错误：

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. string <> boolean at the 11th column of the second table;;
'Union

这是两个数据框的架构：

insertdf.printSchema()
root
 |-- OrganizationID: long (nullable = true)
 |-- SourceID: integer (nullable = true)
 |-- AuditorID: integer (nullable = true)
 |-- AuditorOpinionCode: string (nullable = true)
 |-- AuditorOpinionOnInternalControlCode: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernCode: string (nullable = true)
 |-- IsPlayingAuditorRole: boolean (nullable = true)
 |-- IsPlayingTaxAdvisorRole: boolean (nullable = true)
 |-- AuditorEnumerationId: integer (nullable = true)
 |-- AuditorOpinionId: integer (nullable = true)
 |-- AuditorOpinionOnInternalControlsId: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernId: string (nullable = true)
 |-- IsPlayingCSRAuditorRole: boolean (nullable = true)
 |-- FFAction: string (nullable = true)
 |-- DataPartition: string (nullable = true)

这是第二个数据框的架构：

dfMainOutput.printSchema()
root
 |-- OrganizationID: long (nullable = true)
 |-- SourceID: integer (nullable = true)
 |-- AuditorID: integer (nullable = true)
 |-- AuditorOpinionCode: string (nullable = true)
 |-- AuditorOpinionOnInternalControlCode: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernCode: string (nullable = true)
 |-- IsPlayingAuditorRole: boolean (nullable = true)
 |-- IsPlayingTaxAdvisorRole: boolean (nullable = true)
 |-- AuditorEnumerationId: integer (nullable = true)
 |-- AuditorOpinionId: integer (nullable = true)
 |-- AuditorOpinionOnInternalControlsId: integer (nullable = true)
 |-- AuditorOpinionOnGoingConcernId: boolean (nullable = true)
 |-- IsPlayingCSRAuditorRole: string (nullable = true)
 |-- FFAction: string (nullable = true)
 |-- DataPartition: string (nullable = true)

为了避免这个问题，我可能必须为每一列写一个select。那么是否有任何 Scala 语法可以管理种姓类型或将两个数据帧设为相同类型？

这是我迄今为止尝试过的，但仍然遇到同样的错误：

val columns = dfMainOutput.columns.toSet.intersect(insertdf.columns.toSet).map(col).toSeq

//Perform Union
val dfToSave=dfMainOutput.select(columns: _*).union(insertdf.select(columns: _*)).withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")))

【问题讨论】：

【参考方案1】：

每列的数据类型必须匹配才能执行数据帧的联合。

查看您的架构，有三列不符合此要求：

AuditorOpinionOnInternalControlsId
AuditorOpinionOnGoingConcernId
IsPlayingCSRAuditorRole

更改数据类型的一种简单方法是使用withColumn 和cast。我假设下面代码的正确类型在 dfMainOutput 数据框中：

val insertDfNew = insertdf
  .withColumn("AuditorOpinionOnInternalControlsId", $"AuditorOpinionOnInternalControlsId".cast(IntegerType))
  .withColumn("AuditorOpinionOnGoingConcernId", $"AuditorOpinionOnGoingConcernId".cast(BooleanType))
  .withColumn("IsPlayingCSRAuditorRole", $"IsPlayingCSRAuditorRole".cast(StringType))
  .withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")).otherwise($"FFAction"))

val dfToSave = dfMainOutput.union(insertDfNew)

【讨论】：

另外，您可以通过以下方式进行转换 df.select(col("value").cast("string")) @BdEngineer：当然，你可以做到。如果您有一个要合并为一个的数据框列表，那么您可以这样做：dataframeList.reduce(_.union(_))。如果您逐个计算它们，也可以在 for 循环中迭代地完成，但是您需要有一个起始数据框或创建一个具有正确架构的空数据框。 @BdEngineer：我不确定是否存在任何性能差异，但我认为应该非常相似。我发现了这个问题，其中前两个答案显示了我在上面的评论中写的内容（通过实际示例更清楚一些）：***.com/questions/43489807/… 请注意，您合并的数据框必须是var（它可以更改）而不是val. @BdEngineer：我不确定是什么问题。也许您可以使用代码创建一个新问题，希望我或其他人能够发现问题所在。

以上是关于只能对具有兼容列类型 Spark 数据框的表执行联合的主要内容，如果未能解决你的问题，请参考以下文章

AnalysisException：u“除了只能在具有兼容列类型的表上执行

用于 Kudu 兼容性的 Spark 数据帧转换列

散列火花数据框的多列

根据数据框的重复列值制作具有平均值（平均值）的表[重复]

在 spark scala 中对数据框的每一列进行排序

对具有不同类型的列使用动态反透视