Pyspark agg 函数将行“分解”成列
Posted
技术标签:
【中文标题】Pyspark agg 函数将行“分解”成列【英文标题】:Pyspark agg function to "explode" rows into columns 【发布时间】:2019-09-23 11:59:07 【问题描述】:基本上,我有一个如下所示的数据框:
+----+-------+------+------+
| id | index | col1 | col2 |
+----+-------+------+------+
| 1 | a | a11 | a12 |
+----+-------+------+------+
| 1 | b | b11 | b12 |
+----+-------+------+------+
| 2 | a | a21 | a22 |
+----+-------+------+------+
| 2 | b | b21 | b22 |
+----+-------+------+------+
我想要的输出是这样的:
+----+--------+--------+--------+--------+
| id | col1_a | col1_b | col2_a | col2_b |
+----+--------+--------+--------+--------+
| 1 | a11 | b11 | a12 | b12 |
+----+--------+--------+--------+--------+
| 2 | a21 | b21 | a22 | b22 |
+----+--------+--------+--------+--------+
所以基本上我想在分组id
之后将index
列“分解”成新列。顺便说一句,id
的计数是相同的,并且每个 id
具有相同的一组 index
值。我正在使用 pyspark。
【问题讨论】:
您想对列名 col1_a 和 col1_b 进行硬编码,否则它将是动态的并取决于索引的不同值? 它应该取决于index
的不同值
【参考方案1】:
使用 pivot 可以实现所需的输出。
from pyspark.sql import functions as F
df = spark.createDataFrame([[1,"a","a11","a12"],[1,"b","b11","b12"],[2,"a","a21","a22"],[2,"b","b21","b22"]],["id","index","col1","col2"])
df.show()
+---+-----+----+----+
| id|index|col1|col2|
+---+-----+----+----+
| 1| a| a11| a12|
| 1| b| b11| b12|
| 2| a| a21| a22|
| 2| b| b21| b22|
+---+-----+----+----+
使用枢轴
df3 =df.groupBy("id").pivot("index").agg(F.first(F.col("col1")),F.first(F.col("col2")))
collist=["id","col1_a","col2_a","col1_b","col2_b"]
重命名列
df3.toDF(*collist).show()
+---+------+------+------+------+
| id|col1_a|col2_a|col1_b|col2_b|
+---+------+------+------+------+
| 1| a11| a12| b11| b12|
| 2| a21| a22| b21| b22|
+---+------+------+------+------+
注意根据您的要求重新排列列。
【讨论】:
以上是关于Pyspark agg 函数将行“分解”成列的主要内容,如果未能解决你的问题,请参考以下文章