给定主键，比较两个数据框的其他列，垂直输出diff列

Posted 2023-04-15

技术标签:

【中文标题】给定主键，比较两个数据框的其他列，垂直输出diff列【英文标题】：Given primary key, compare other columns of two data frames and output diff columns in the vertical way 【发布时间】：2018-04-14 18:49:22 【问题描述】：

我想比较两个具有相同架构且具有主键列的数据框。

对于每个主键，如果其他列有任何差异（可能是多个列，所以需要使用某种动态方式扫描所有其他列），我想输出两个数据框的列名和值。

另外，如果另一个数据帧中不存在一个主键，我想输出结果（因此将使用“完全外连接”）。举个例子：

数据框1：

+-----------+------+------+
|primary_key|book  |number|
+-----------+------+------+
|1          |book1 | 1    | 
|2          |book2 | 2    |
|3          |book3 | 3    |
|4          |book4 | 4    |
+-----------+------+------+

数据框2：

+-----------+------+------+
|primary_key|book  |number|
+-----------+------+------+
|1          |book1 | 1    | 
|2          |book8 | 8    |
|3          |book3 | 7    |
|5          |book5 | 5    |
+-----------+------+------+

结果是：

+-----------+------+----------+------------+------------*
|primary_key|diff_column_name | dataframe1 | dataframe2 |
+-----------+------+----------+------------+------------*
|2          |book             | book2      | book8      |
|2          |number           | 2          | 8          |
|3          |number           | 3          | 7          |
|4          |book             | book4      | null       |
|4          |number           | 4          | null       |
|5          |book             | null       | book5      |
|5          |number           | null       | 5          |
+-----------+------+----------+------------+------------*

我知道第一步是在主键上加入两个数据框：

// joining the two DFs on primary_key
val result = df1.as("l")
    .join(df2.as("r"), "primary_key", "fullouter")

但我不确定如何进行。有人可以给我一些建议吗？谢谢

【问题讨论】：

如果多列有不同的值怎么办？ 【参考方案1】：

数据：

val df1 = Seq(
  (1, "book1", 1), (2, "book2", 2), (3, "book3", 3), (4, "book4", 4)
).toDF("primary_key", "book", "number")

val df2 = Seq(
  (1, "book1", 1), (2, "book8", 8), (3, "book3", 7), (5, "book5", 5)
).toDF("primary_key", "book", "number")

进口

import org.apache.spark.sql.functions._

定义列列表：

val cols = Seq("book", "number")

像现在一样加入：

 val joined = df1.as("l").join(df2.as("r"), Seq("primary_key"), "fullouter")

定义：

val comp = explode(array(cols.map(c => struct(
  lit(c).alias("diff_column_name"), 
  // Value left
  col(s"l.$c").cast("string").alias("dataframe1"),  
  // Value right
  col(s"r.$c").cast("string").alias("dataframe2"),
  // Differs
  not(col(s"l.$c") <=> col(s"r.$c")).alias("diff")
)): _*))

选择和过滤：

joined
  .withColumn("comp", comp)
  .select($"primary_key", $"comp.*")
  // Filter out mismatches and get rid of obsolete diff
  .where($"diff").drop("diff")
  .orderBy("primary_key").show
// +-----------+----------------+----------+----------+
// |          2|            book|     book2|     book8|
// |          2|          number|         2|         8|
// |          3|          number|         3|         7|
// |          4|            book|     book4|      null|
// |          4|          number|         4|      null|
// |          5|            book|      null|     book5|
// |          5|          number|      null|         5|
// +-----------+----------------+----------+----------+

【讨论】：

非常感谢。如果 primary_key 有多个列怎么办，因为我想在结果的前面显示它们中的每一个。我是否应该简单地将代码修改为： val columnNames = Seq("key1","key2", "comp.*")joined.withColumn("comp", comp).select(columnNames.head, columnNames.tail: _*) .where($"diff").drop("diff") 你有相同的 python spark 代码集吗？ @hi-zir

以上是关于给定主键，比较两个数据框的其他列，垂直输出diff列的主要内容，如果未能解决你的问题，请参考以下文章