加入多个表后如何处理空值

Posted 2023-04-15

技术标签:

【中文标题】加入多个表后如何处理空值【英文标题】：How to treat null values after joining multiple tables 【发布时间】：2020-06-01 18:51:30 【问题描述】：

我正在通过加入 4 个数据框来创建一个新的数据框之后我需要连接来自不同数据帧的两个相同列数据：

col1   col2 expected
Acc1   Acc1 Acc1Acc1
Acc2   null Acc2
null   Acc3 Acc3

问题：如果我在不替换空值的情况下进行连接；我松了信息所以加入后；由于 pyspark 不会删除公共列，因此我们有来自 2 个表的两个 Account 列我尝试用空字符串替换它；它不起作用并引发错误：数据帧不可迭代

查询：加入表后如何用空字符串替换空值？或者有什么办法可以同时处理 null 和 concat ？

df = df1\
.join(df2,"code",how = 'left') \
.join(df3,"id",how = 'left')\
.join(df4,"id",how = 'left')\
.withColumn('Account',F.when(df2('Account').isNull(),'').otherwise(df2('Account')))\
.withColumn('Account',F.when(df3('Account').isNull(),'').otherwise(df3('Account')))\
.withColumn("Account",F.concat(F.trim(df2.Account), F.trim(df3.Account)))

【问题讨论】：

【参考方案1】：

您好，欢迎来到 ***。 pyspark.sql.functions.concat_ws 之类的函数应该可以在这里解决问题，例如：

import pyspark.sql.functions as f


df = spark.createDataFrame([
    (1, "John", "Smith"),
    (2, "Monty", "Python"), 
    (3, "Donald", None), 
], ['id', 'firstname', 'lastname'] 
)
df.show()
+---+---------+--------+
| id|firstname|lastname|
+---+---------+--------+
|  1|     John|   Smith|
|  2|    Monty|  Python|
|  3|   Donald|    null|
+---+---------+--------+

df.select(
    "*",
    f.concat_ws(
      "", 
      f.trim(f.col("firstname")), f.trim(f.col("lastname"))
    ).alias("concatenated")
).show()
+---+---------+--------+------------+
| id|firstname|lastname|concatenated|
+---+---------+--------+------------+
|  1|     John|   Smith|   JohnSmith|
|  2|    Monty|  Python| MontyPython|
|  3|   Donald|    null|      Donald|
+---+---------+--------+------------+

希望对您有所帮助，您可以在此处找到有关该功能的更多信息：https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.concat

【讨论】：

您好，谢谢您，但是，加入后出现问题....因为两个列具有相同的名称...我使用 withColumn 并使用了 tablename['colname]...没用..加入后如何使用concat函数

以上是关于加入多个表后如何处理空值的主要内容，如果未能解决你的问题，请参考以下文章