如何根据 Spark Scala 中其他数据帧中的多列匹配过滤数据帧

Posted 2023-04-15

技术标签:

【中文标题】如何根据 Spark Scala 中其他数据帧中的多列匹配过滤数据帧【英文标题】：How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala 【发布时间】：2020-10-28 15:31:56 【问题描述】：

假设我有如下三个数据框：

  val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
  val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

这是表格视图：

我想将 df2 过滤到只有 sport1 和 sport2 组合是 df1 的有效行的行。例如，由于在 df1，sport1 -> Run，sport2 -> Run 是有效行，它会将其作为 df2 中的行之一返回。它不会从 df2 返回 sport1 -> Bike, sport2 -> Bike。而且它根本不会考虑“名称”列的值是什么。

我正在寻找的预期结果是具有以下数据的数据框：

+-------+------+------+
|name   |sport1|sport2|
+-------+------+------+
|kevin  |Run   |Run   |
|anthony|Fish  |Fish  |
+-------+------+------+

谢谢，祝你有美好的一天！

【问题讨论】：

预期的答案是什么。你自己试过了吗 【参考方案1】：

试试这个，

val res = df3.intersect(df1).union(df3.intersect(df2))

+------+------+
|sport1|sport2|
+------+------+
|   Run|   Run|
|  Fish|  Fish|
|  Swim|  Fish|
+------+------+

【讨论】：

是的，但是如果有第三列没有一个匹配，但我仍然想要这两列匹配的结果。【参考方案2】：

要根据其他数据框中的多列匹配过滤数据框，您可以使用join：

df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))

由于默认连接是内连接，您将只保留两个数据帧中“sport1”和“sport2”相同的行。由于我们使用列列表Seq("sport1", "sport2") 作为连接条件，因此列“sport1”和“sport2”不会重复

使用您示例的输入数据：

val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

你得到：

+------+------+-------+
|sport1|sport2|name   |
+------+------+-------+
|Run   |Run   |kevin  |
|Fish  |Fish  |anthony|
+------+------+-------+

【讨论】：

以上是关于如何根据 Spark Scala 中其他数据帧中的多列匹配过滤数据帧的主要内容，如果未能解决你的问题，请参考以下文章