Pyspark:基于所有列减去/差异 pyspark 数据帧

Posted

技术标签:

【中文标题】Pyspark:基于所有列减去/差异 pyspark 数据帧【英文标题】:Pyspark : Subtracting/Difference pyspark dataframes based on all columns 【发布时间】:2020-11-04 21:03:28 【问题描述】:

我有两个如下所示的 pyspark 数据框 -

df1

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Sydney     Australia    AU         AU
4      London     UK           EU         EU

df2

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Paris      France       EU         EU
5      London     UK           EU         EU

我想根据所有列值找出 df2 中存在但 df1 中不存在的行。所以 df2 - df1 应该导致 df_result 如下所示

df_result

id     city      country       region    continent
3      Paris      France       EU         EU
5      London     UK           EU         EU

如何在 pyspark 中实现它。提前致谢

【问题讨论】:

【参考方案1】:

您可以使用left_anti 加入:

df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()

+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+

如果所有列都有非空值:

df2.join(df1, on = df2.schema.names, how = "left_anti").show()

【讨论】:

【参考方案2】:

一个更简单的解决方案是使用 exceptAll() 函数。医生说-

返回一个新的 SparkDataFrame,其中包含此 SparkDataFrame 中的行,但不包含另一个 SparkDataFrame 中的行,同时保留重复项。这相当于 SQL 中的 EXCEPT ALL。同样作为 SQL 中的标准,此函数按位置(而不是按名称)解析列

在此处创建 DF

df_a = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Sydney","Australia","AU","AU"),(4,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_a.show(truncate=False)
df_b = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Paris","France","EU","EU"),(5,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_b.show(truncate=False)

df_a

+---+-------+---------+------+---------+
|id |city   |country  |region|continent|
+---+-------+---------+------+---------+
|1  |chicago|USA      |NA    |NA       |
|2  |houston|USA      |NA    |NA       |
|3  |Sydney |Australia|AU    |AU       |
|4  |London |UK       |EU    |EU       |
+---+-------+---------+------+---------+

df_b

+---+-------+-------+------+---------+
|id |city   |country|region|continent|
+---+-------+-------+------+---------+
|1  |chicago|USA    |NA    |NA       |
|2  |houston|USA    |NA    |NA       |
|3  |Paris  |France |EU    |EU       |
|5  |London |UK     |EU    |EU       |
+---+-------+-------+------+---------+

最终输出

df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+

【讨论】:

以上是关于Pyspark:基于所有列减去/差异 pyspark 数据帧的主要内容,如果未能解决你的问题,请参考以下文章

pyspark 数据框比较以根据关键字段查找列差异

减去 Pandas 或 Pyspark 数据框中的连续列

基于pyspark中仅一列的两个DataFrame之间的差异[重复]

在 PySpark Python 中减去两个日期列

Python PySpark:从日期列中减去整数列错误:列对象不可调用

Pyspark减去而不选择列[重复]