当 ID 匹配时,在其他 Pyspark 数据帧中按列划分 Pyspark 数据帧列
Posted
技术标签:
【中文标题】当 ID 匹配时,在其他 Pyspark 数据帧中按列划分 Pyspark 数据帧列【英文标题】:Divide Pyspark Dataframe Column by Column in other Pyspark Dataframe when ID Matches 【发布时间】:2017-04-07 21:38:31 【问题描述】:我有一个 PySpark 数据帧 df1,它看起来像:
CustomerID CustomerValue
12 .17
14 .15
14 .25
17 .50
17 .01
17 .35
我有第二个 PySpark 数据帧 df2,它是按 CustomerID 分组并由 sum 函数聚合的 df1。它看起来像这样:
CustomerID CustomerValueSum
12 .17
14 .40
17 .86
我想向 df1 添加第三列,即 df1['CustomerValue'] 除以 df2['CustomerValueSum'] 以获得相同的 CustomerID。这看起来像:
CustomerID CustomerValue NormalizedCustomerValue
12 .17 1.00
14 .15 .38
14 .25 .62
17 .50 .58
17 .01 .01
17 .35 .41
换句话说,我正在尝试将此 Python/Pandas 代码转换为 PySpark:
normalized_list = []
for idx, row in df1.iterrows():
(
normalized_list
.append(
row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
)
)
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]
我该怎么做?
【问题讨论】:
【参考方案1】:代码:
import pyspark.sql.functions as F
df1 = df1\
.join(df2, "CustomerID")\
.withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\
.drop("CustomerValueSum")
输出:
df1.show()
+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
| 17| 0.5| 0.5813953488372093|
| 17| 0.01| 0.011627906976744186|
| 17| 0.35| 0.4069767441860465|
| 12| 0.17| 1.0|
| 14| 0.15| 0.37499999999999994|
| 14| 0.25| 0.625|
+----------+-------------+-----------------------+
【讨论】:
【参考方案2】:这也可以使用 Spark Window 函数来实现,您无需使用聚合值 (df2) 创建单独的数据框:
为输入数据框创建数据:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
data =[(12, 0.17), (14, 0.15), (14, 0.25), (17, 0.5), (17, 0.01), (17, 0.35)]
df1 = sqlContext.createDataFrame(data, ['CustomerID', 'CustomerValue'])
df1.show()
+----------+-------------+
|CustomerID|CustomerValue|
+----------+-------------+
| 12| 0.17|
| 14| 0.15|
| 14| 0.25|
| 17| 0.5|
| 17| 0.01|
| 17| 0.35|
+----------+-------------+
定义一个按 CustomerID 分区的窗口:
from pyspark.sql import Window
from pyspark.sql.functions import sum
w = Window.partitionBy('CustomerID')
df2 = df1.withColumn('NormalizedCustomerValue', df1.CustomerValue/sum(df1.CustomerValue).over(w)).orderBy('CustomerID')
df2.show()
+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
| 12| 0.17| 1.0|
| 14| 0.15| 0.37499999999999994|
| 14| 0.25| 0.625|
| 17| 0.5| 0.5813953488372093|
| 17| 0.01| 0.011627906976744186|
| 17| 0.35| 0.4069767441860465|
+----------+-------------+-----------------------+
【讨论】:
以上是关于当 ID 匹配时,在其他 Pyspark 数据帧中按列划分 Pyspark 数据帧列的主要内容,如果未能解决你的问题,请参考以下文章