Pyspark:基于所有列减去/差异 pyspark 数据帧
Posted
技术标签:
【中文标题】Pyspark:基于所有列减去/差异 pyspark 数据帧【英文标题】:Pyspark : Subtracting/Difference pyspark dataframes based on all columns 【发布时间】:2020-11-04 21:03:28 【问题描述】:我有两个如下所示的 pyspark 数据框 -
df1
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Sydney Australia AU AU
4 London UK EU EU
df2
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Paris France EU EU
5 London UK EU EU
我想根据所有列值找出 df2 中存在但 df1 中不存在的行。所以 df2 - df1 应该导致 df_result 如下所示
df_result
id city country region continent
3 Paris France EU EU
5 London UK EU EU
如何在 pyspark 中实现它。提前致谢
【问题讨论】:
【参考方案1】:您可以使用left_anti
加入:
df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
如果所有列都有非空值:
df2.join(df1, on = df2.schema.names, how = "left_anti").show()
【讨论】:
【参考方案2】:一个更简单的解决方案是使用 exceptAll()
函数。医生说-
返回一个新的 SparkDataFrame,其中包含此 SparkDataFrame 中的行,但不包含另一个 SparkDataFrame 中的行,同时保留重复项。这相当于 SQL 中的 EXCEPT ALL。同样作为 SQL 中的标准,此函数按位置(而不是按名称)解析列
在此处创建 DF
df_a = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Sydney","Australia","AU","AU"),(4,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_a.show(truncate=False)
df_b = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Paris","France","EU","EU"),(5,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_b.show(truncate=False)
df_a
+---+-------+---------+------+---------+
|id |city |country |region|continent|
+---+-------+---------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Sydney |Australia|AU |AU |
|4 |London |UK |EU |EU |
+---+-------+---------+------+---------+
df_b
+---+-------+-------+------+---------+
|id |city |country|region|continent|
+---+-------+-------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Paris |France |EU |EU |
|5 |London |UK |EU |EU |
+---+-------+-------+------+---------+
最终输出
df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
【讨论】:
以上是关于Pyspark:基于所有列减去/差异 pyspark 数据帧的主要内容,如果未能解决你的问题,请参考以下文章
基于pyspark中仅一列的两个DataFrame之间的差异[重复]