如何在pyspark数据框中添加多个带有when条件的新列?

Posted

技术标签:

【中文标题】如何在pyspark数据框中添加多个带有when条件的新列?【英文标题】:How to add multiple new columns with when condition in pyspark dataframe? 【发布时间】:2022-01-15 22:09:36 【问题描述】:
I need to add two new columns to my existing pyspark dataframe.
Below is my sample data:

Section   Grade     Promotion_grade Section_team
Admin       C       
Account     B       
IT          B   

condition :

If Section = Admin then Promotion_grade = B
If Section = Account then Promotion_grade = A
If Section = IT then
             If Grade = C then Promotion_grade = B & Section_team= team1
             If Grade = D  then Promotion_grade = C & Section_team= team2
             If Grade = A  then Promotion_grade = A+ & Section_team= team3

我可以为前两个条件添加一列。但我不知道其余的条件。

def addCols(data):
   data = (data.withColumn('Promotion_grade', F.when(data.Section  =='Admin', 'B')
                                                .when(data.Section  =='Account', 'A')
                                                .otherwise('Not applicable')))
   return data

请有人可以帮助我吗?可能是我正在做的方式是错误的。谢谢

【问题讨论】:

【参考方案1】:

您可以嵌套when 条件来处理嵌套条件。

工作示例

from pyspark.sql import functions as F

data = [("Admin", "C", ), 
        ("Account", "B", ), 
        ("IT", "B", ),
        ("IT", "C", ),
        ("IT", "D", ),
        ("IT", "A", ),]

df = spark.createDataFrame(data, ("Section", "Grade", ))

# Define Promotion Grade conditions for IT Section
it_promotion_grade = (F.when(F.col("Grade") == "C", "B")
                       .when(F.col("Grade") == "D", "C")
                       .when(F.col("Grade") == "A", "A+")
                       .otherwise("Not applicable"))

# Define Section Team conditions for IT Section
it_section_team = (F.when(F.col("Grade") == "C", "team1")
                    .when(F.col("Grade") == "D", "team2")
                    .when(F.col("Grade") == "A", "team3")
                    .otherwise("Not applicable"))

(df.withColumn("Promotion_grade", F.when(F.col("Section") == "Admin", "B")
                                  .when(F.col("Section") == "Account", "A")
                                  .when(F.col("Section") == "IT", it_promotion_grade)
                                  .otherwise("Not applicable"))
    .withColumn("Section_team", F.when(F.col("Section") == "IT", it_section_team)
                     .otherwise("Not applicable"))
    .show())

输出

+-------+-----+---------------+--------------+
|Section|Grade|Promotion_grade|  Section_team|
+-------+-----+---------------+--------------+
|  Admin|    C|              B|Not applicable|
|Account|    B|              A|Not applicable|
|     IT|    B| Not applicable|Not applicable|
|     IT|    C|              B|         team1|
|     IT|    D|              C|         team2|
|     IT|    A|             A+|         team3|
+-------+-----+---------------+--------------+

【讨论】:

天啊,非常清楚!!非常感谢。

以上是关于如何在pyspark数据框中添加多个带有when条件的新列?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 pyspark.sql.functions.when() 中使用多个条件?

如何从数据框中获取 1000 条记录并使用 PySpark 写入文件?

如何在 pyspark 中对 spark 数据框中的多列求和?

如何在 pyspark 中对 spark 数据框中的多列求和?

如何在字典中使用 pyspark.sql.functions.when() 的多个条件?

带有 hive 的 pyspark - 无法正确创建分区并从数据框中保存表