在 Pyspark/Hive 中有条件的运行总计

Posted 2023-04-15

技术标签:

【中文标题】在 Pyspark/Hive 中有条件的运行总计【英文标题】：Running total with conditional in Pyspark/Hive 【发布时间】：2021-03-21 02:13:09 【问题描述】：

我有产品、品牌和百分比列。我想计算与当前行具有不同品牌的行以及与当前行具有相同品牌的行的百分比列的总和。如何在 PySpark 中或使用 spark.sql 来完成？

样本数据：

df = pd.DataFrame('a': ['a1','a2','a3','a4','a5','a6'],
              'brand':['b1','b2','b1', 'b3', 'b2','b1'],
          'pct': [40, 30, 10, 8,7,5])
df = spark.createDataFrame(df)

我在寻找什么：

product  brand  pct  pct_same_brand  pct_different_brand
a1       b1     40     null           null
a2       b2     30     null           40
a3       b1     10     40             30
a4       b3     8      null           80
a5       b2     7      30             58
a6       b1     5      50             45

这是我尝试过的：

df.createOrReplaceTempView('tmp')
spark.sql("""
select *, sum(pct * (select case when n1.brand = n2.brand then 1 else 0 end 
from tmp n1)) over(order by pct desc rows between UNBOUNDED PRECEDING and 1 
preceding) 
from tmp n2
""").show()

【问题讨论】：

【参考方案1】：

您可以通过从分区滚动总和（即pct_same_brand 列）中减去总滚动总和来获得pct_different_brand 列：

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'pct_same_brand', 
    F.sum('pct').over(
        Window.partitionBy('brand')
              .orderBy(F.desc('pct'))
              .rowsBetween(Window.unboundedPreceding, -1)
    )
).withColumn(
    'pct_different_brand', 
    F.sum('pct').over(
        Window.orderBy(F.desc('pct'))
              .rowsBetween(Window.unboundedPreceding, -1)
    ) - F.coalesce(F.col('pct_same_brand'), F.lit(0))
)

df2.show()

+---+-----+---+--------------+-------------------+
|  a|brand|pct|pct_same_brand|pct_different_brand|
+---+-----+---+--------------+-------------------+
| a1|   b1| 40|          null|               null|
| a2|   b2| 30|          null|                 40|
| a3|   b1| 10|            40|                 30|
| a4|   b3|  8|          null|                 80|
| a5|   b2|  7|            30|                 58|
| a6|   b1|  5|            50|                 45|
+---+-----+---+--------------+-------------------+

【讨论】：

谢谢一百万。您还可以帮助计算上面编辑版本中提到的价格差异吗？如果您有新问题，请提出新问题，而不是修改现有问题

以上是关于在 Pyspark/Hive 中有条件的运行总计的主要内容，如果未能解决你的问题，请参考以下文章