如何计算窗口内的平均值,其中窗口的范围取决于列的值?
Posted
技术标签:
【中文标题】如何计算窗口内的平均值,其中窗口的范围取决于列的值?【英文标题】:How to calculate a mean inside a window where the range of the window depends on the value of a column? 【发布时间】:2019-11-11 12:51:10 【问题描述】:我有以下数据:
columns = ['aircraft_id', 'Liftoff', 'timestamp', 'value']
l =[
( '0003177d',1550000476500,1550000467000, -80.15625),
( '0003177d',1550000476500,1550000467500, -80.15625),
( '0003177d',1550000476500,1550000468000, -80.15625),
( '0003177d',1550000476500,1550000468500, -80.15625),
( '0003177d',1550000476500,1550000469000,-79.8046875),
( '0003177d',1550000476500,1550000469500,-79.8046875),
( '0003177d',1550000476500,1550000470000,-79.8046875),
( '0003177d',1550000476500,1550000470500,-79.8046875),
( '0003177d',1550000476500,1550000471000,-79.8046875),
( '0003177d',1550000476500,1550000471500,-79.8046875),
( '0003177d',1550000476500,1550000472000, -80.15625),
( '0003177d',1550000476500,1550000472500,-80.5078125),
( '0003177d',1550000476500,1550000473000, -80.859375),
( '0003177d',1550000476500,1550000473500, -80.859375),
( '0003177d',1550000476500,1550000474000, -80.859375),
( '0003177d',1550000476500,1550000474500, -80.859375),
( '0003177d',1550000476500,1550000475000, -80.859375),
( '0003177d',1550000476500,1550000475500, -80.859375),
( '0003177d',1550000476500,1550000476000, -80.859375),
( '0003177d',1550000476500,1550000476500,-80.5078125)]
df=spark.createDataFrame(l, columns)
df.show()
+-----------+-------------+-------------+-----------+
|aircraft_id| Liftoff| timestamp| value|
+-----------+-------------+-------------+-----------+
| 0003177d|1550000476500|1550000467000| -80.15625|
| 0003177d|1550000476500|1550000467500| -80.15625|
| 0003177d|1550000476500|1550000468000| -80.15625|
| 0003177d|1550000476500|1550000468500| -80.15625|
| 0003177d|1550000476500|1550000469000|-79.8046875|
| 0003177d|1550000476500|1550000469500|-79.8046875|
| 0003177d|1550000476500|1550000470000|-79.8046875|
| 0003177d|1550000476500|1550000470500|-79.8046875|
| 0003177d|1550000476500|1550000471000|-79.8046875|
| 0003177d|1550000476500|1550000471500|-79.8046875|
| 0003177d|1550000476500|1550000472000| -80.15625|
| 0003177d|1550000476500|1550000472500|-80.5078125|
| 0003177d|1550000476500|1550000473000| -80.859375|
| 0003177d|1550000476500|1550000473500| -80.859375|
| 0003177d|1550000476500|1550000474000| -80.859375|
| 0003177d|1550000476500|1550000474500| -80.859375|
| 0003177d|1550000476500|1550000475000| -80.859375|
| 0003177d|1550000476500|1550000475500| -80.859375|
| 0003177d|1550000476500|1550000476000| -80.859375|
| 0003177d|1550000476500|1550000476500|-80.5078125|
+-----------+-------------+-------------+-----------+
我想计算窗口内的值的平均值,其中窗口之间的范围取决于时间戳的当前值与 Liftoff 的时间戳。每架飞机都有不同的 Liftoff 值。
我试试:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('val', F.mean('value').over(Window.partitionBy('aircraft_id','ini_TO','Liftoff').orderBy('timestamp').rangeBetween(df['timestamp'], df['Liftoff']))
但是不行,有解决办法吗?
预期结果:
对于第一行,窗口的范围是从 1550000467000 到 1550000476500,因此平均值是 20 个值的总和除以 20 (-80,33203)。 对于第二行,窗口的范围是从 1550000467500 到 1550000476500,因此平均值是 19 个值的总和除以 19 (-80,34128|)。 等等……+-----------+-------------+-------------+---------+---------+
|aircraft_id| Liftoff| timestamp| value| val|
+-----------+-------------+-------------+---------+---------+
| 0003177d|1550000476500|1550000467000|-80,15625|-80,33203|
| 0003177d|1550000476500|1550000467500|-80,15625|-80,34128|
| 0003177d|1550000476500|1550000468000|-80,15625|-80,35156|
| 0003177d|1550000476500|1550000468500|-80,15625|-80,36305|
| 0003177d|1550000476500|1550000469000|-79,80469|-80,37598|
| 0003177d|1550000476500|1550000469500|-79,80469|-80,41406|
| 0003177d|1550000476500|1550000470000|-79,80469|-80,45759|
| 0003177d|1550000476500|1550000470500|-79,80469|-80,50781|
| 0003177d|1550000476500|1550000471000|-79,80469|-80,56641|
| 0003177d|1550000476500|1550000471500|-79,80469|-80,63565|
| 0003177d|1550000476500|1550000472000|-80,15625|-80,71875|
| 0003177d|1550000476500|1550000472500|-80,50781|-80,78125|
| 0003177d|1550000476500|1550000473000|-80,85938|-80,81543|
| 0003177d|1550000476500|1550000473500|-80,85938|-80,80915|
| 0003177d|1550000476500|1550000474000|-80,85938|-80,80078|
| 0003177d|1550000476500|1550000474500|-80,85938|-80,78906|
| 0003177d|1550000476500|1550000475000|-80,85938|-80,77148|
| 0003177d|1550000476500|1550000475500|-80,85938|-80,74219|
| 0003177d|1550000476500|1550000476000|-80,85938|-80,68359|
| 0003177d|1550000476500|1550000476500|-80,50781|-80,50781|
+-----------+-------------+-------------+---------+---------+
【问题讨论】:
这里是计算 pyspark 数据帧平均值的参考链接:***.com/questions/47995188/…。您也可以查看此链接以供参考 - ***.com/questions/44382822/… rangebetween 采用两个整数值来引用相对于当前行的行。它不能处理列。您能否向我们展示您的示例的预期输出? @cronoik,我编辑了带有预期结果的帖子,还编辑了 Liftoff 的值以使示例更容易。感谢您的帮助。 【参考方案1】:我想你快到了,你只需要在窗口规范中设置 rangeBetween
以从当前行 Window.currentRow
开始,直到窗口范围的末尾 Window.unboundedFollowing
如下所示:
注意:ini_TO
未在示例数据集中提供,因此从partitionBy
中删除以进行测试。
wind_spec = Window.partitionBy('aircraft_id','Liftoff').orderBy('timestamp').rangeBetween(Window.currentRow, Window.unboundedFollowing)
上面的窗口会给出想要的输出:
df.withColumn('val', F.mean('value').over(wind_spec)).show()
+-----------+-------------+-------------+-----------+------------------+
|aircraft_id| Liftoff| timestamp| value| val|
+-----------+-------------+-------------+-----------+------------------+
| 0003177d|1550000476500|1550000467000| -80.15625| -80.33203125|
| 0003177d|1550000476500|1550000467500| -80.15625|-80.34128289473684|
| 0003177d|1550000476500|1550000468000| -80.15625| -80.3515625|
| 0003177d|1550000476500|1550000468500| -80.15625|-80.36305147058823|
| 0003177d|1550000476500|1550000469000|-79.8046875| -80.3759765625|
| 0003177d|1550000476500|1550000469500|-79.8046875| -80.4140625|
| 0003177d|1550000476500|1550000470000|-79.8046875|-80.45758928571429|
| 0003177d|1550000476500|1550000470500|-79.8046875| -80.5078125|
| 0003177d|1550000476500|1550000471000|-79.8046875| -80.56640625|
| 0003177d|1550000476500|1550000471500|-79.8046875| -80.6356534090909|
| 0003177d|1550000476500|1550000472000| -80.15625| -80.71875|
| 0003177d|1550000476500|1550000472500|-80.5078125| -80.78125|
| 0003177d|1550000476500|1550000473000| -80.859375| -80.8154296875|
| 0003177d|1550000476500|1550000473500| -80.859375|-80.80915178571429|
| 0003177d|1550000476500|1550000474000| -80.859375| -80.80078125|
| 0003177d|1550000476500|1550000474500| -80.859375| -80.7890625|
| 0003177d|1550000476500|1550000475000| -80.859375| -80.771484375|
| 0003177d|1550000476500|1550000475500| -80.859375| -80.7421875|
| 0003177d|1550000476500|1550000476000| -80.859375| -80.68359375|
| 0003177d|1550000476500|1550000476500|-80.5078125| -80.5078125|
+-----------+-------------+-------------+-----------+------------------+
【讨论】:
以上是关于如何计算窗口内的平均值,其中窗口的范围取决于列的值?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 MySQL 中为每个类别创建一个 SQL 窗口函数列?