如何根据 Spark Scala 中其他数据帧中的多列匹配过滤数据帧
Posted
技术标签:
【中文标题】如何根据 Spark Scala 中其他数据帧中的多列匹配过滤数据帧【英文标题】:How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala 【发布时间】:2020-10-28 15:31:56 【问题描述】:假设我有如下三个数据框:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
这是表格视图:
我想将 df2 过滤到只有 sport1 和 sport2 组合是 df1 的有效行的行。例如,由于在 df1,sport1 -> Run,sport2 -> Run 是有效行,它会将其作为 df2 中的行之一返回。它不会从 df2 返回 sport1 -> Bike, sport2 -> Bike。而且它根本不会考虑“名称”列的值是什么。
我正在寻找的预期结果是具有以下数据的数据框:
+-------+------+------+
|name |sport1|sport2|
+-------+------+------+
|kevin |Run |Run |
|anthony|Fish |Fish |
+-------+------+------+
谢谢,祝你有美好的一天!
【问题讨论】:
预期的答案是什么。你自己试过了吗 【参考方案1】:试试这个,
val res = df3.intersect(df1).union(df3.intersect(df2))
+------+------+
|sport1|sport2|
+------+------+
| Run| Run|
| Fish| Fish|
| Swim| Fish|
+------+------+
【讨论】:
是的,但是如果有第三列没有一个匹配,但我仍然想要这两列匹配的结果。【参考方案2】:要根据其他数据框中的多列匹配过滤数据框,您可以使用join
:
df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))
由于默认连接是内连接,您将只保留两个数据帧中“sport1”和“sport2”相同的行。由于我们使用列列表Seq("sport1", "sport2")
作为连接条件,因此列“sport1”和“sport2”不会重复
使用您示例的输入数据:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
你得到:
+------+------+-------+
|sport1|sport2|name |
+------+------+-------+
|Run |Run |kevin |
|Fish |Fish |anthony|
+------+------+-------+
【讨论】:
以上是关于如何根据 Spark Scala 中其他数据帧中的多列匹配过滤数据帧的主要内容,如果未能解决你的问题,请参考以下文章
Spark SCALA - 连接两个数据帧,其中一个数据帧中的连接值位于第二个数据帧中的两个字段之间
在 spark scala 中为数据帧中的每个组采样不同数量的随机行