当 ID 匹配时，在其他 Pyspark 数据帧中按列划分 Pyspark 数据帧列

Posted 2023-04-15

技术标签:

【中文标题】当 ID 匹配时，在其他 Pyspark 数据帧中按列划分 Pyspark 数据帧列【英文标题】：Divide Pyspark Dataframe Column by Column in other Pyspark Dataframe when ID Matches 【发布时间】：2017-04-07 21:38:31 【问题描述】：

我有一个 PySpark 数据帧 df1，它看起来像：

CustomerID  CustomerValue
12          .17
14          .15
14          .25
17          .50
17          .01
17          .35

我有第二个 PySpark 数据帧 df2，它是按 CustomerID 分组并由 sum 函数聚合的 df1。它看起来像这样：

 CustomerID  CustomerValueSum
 12          .17
 14          .40
 17          .86

我想向 df1 添加第三列，即 df1['CustomerValue'] 除以 df2['CustomerValueSum'] 以获得相同的 CustomerID。这看起来像：

CustomerID  CustomerValue  NormalizedCustomerValue
12          .17            1.00
14          .15            .38
14          .25            .62
17          .50            .58
17          .01            .01
17          .35            .41

换句话说，我正在尝试将此 Python/Pandas 代码转换为 PySpark：

normalized_list = []
for idx, row in df1.iterrows():
    (
        normalized_list
        .append(
            row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
        )
    )
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]

我该怎么做？

【问题讨论】：

【参考方案1】：

代码：

import pyspark.sql.functions as F

df1 = df1\
    .join(df2, "CustomerID")\
    .withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\
    .drop("CustomerValueSum")

输出：

df1.show()

+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
+----------+-------------+-----------------------+

【讨论】：

【参考方案2】：

这也可以使用 Spark Window 函数来实现，您无需使用聚合值 (df2) 创建单独的数据框：

为输入数据框创建数据：

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

data =[(12, 0.17), (14, 0.15), (14, 0.25), (17, 0.5), (17, 0.01), (17, 0.35)]
df1 = sqlContext.createDataFrame(data, ['CustomerID', 'CustomerValue'])
df1.show()
+----------+-------------+
|CustomerID|CustomerValue|
+----------+-------------+
|        12|         0.17|
|        14|         0.15|
|        14|         0.25|
|        17|          0.5|
|        17|         0.01|
|        17|         0.35|
+----------+-------------+

定义一个按 CustomerID 分区的窗口：

from pyspark.sql import Window
from pyspark.sql.functions import sum

w = Window.partitionBy('CustomerID')

df2 = df1.withColumn('NormalizedCustomerValue', df1.CustomerValue/sum(df1.CustomerValue).over(w)).orderBy('CustomerID')

df2.show()
+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
+----------+-------------+-----------------------+

【讨论】：

以上是关于当 ID 匹配时，在其他 Pyspark 数据帧中按列划分 Pyspark 数据帧列的主要内容，如果未能解决你的问题，请参考以下文章

pyspark 数据帧中的完全外连接

如何在pyspark数据帧中过滤空值？

x00 出现在 Pyspark 数据帧中的每个字符之间

如果在 pyspark 数据帧中后跟连续 5 个“0”，则在条件下获取第一个“1”

逐行计算pyspark数据帧中的空值数

如何比较pyspark中两个不同数据帧中的两列