熊猫适用于滚动多列输出

Posted 2023-03-12

技术标签:

【中文标题】熊猫适用于滚动多列输出【英文标题】：Pandas apply on rolling with multi-column output 【发布时间】：2020-10-24 05:54:14 【问题描述】：

我正在编写将滚动窗口应用于返回多列的函数的代码。

输入：熊猫系列预期输出：3 列 DataFrame

def fun1(series, ):
    # Some calculations producing numbers a, b and c
    return "a": a, "b": b, "c": c 

res.rolling('21 D').apply(fun1)

资源内容：

time
2019-09-26 16:00:00    0.674969
2019-09-26 16:15:00    0.249569
2019-09-26 16:30:00   -0.529949
2019-09-26 16:45:00   -0.247077
2019-09-26 17:00:00    0.390827
                         ...   
2019-10-17 22:45:00    0.232998
2019-10-17 23:00:00    0.590827
2019-10-17 23:15:00    0.768991
2019-10-17 23:30:00    0.142661
2019-10-17 23:45:00   -0.555284
Length: 1830, dtype: float64

错误：

TypeError: must be real number, not dict

我尝试过的：

在应用中更改 raw=True 在应用中使用 lambda 函数以列表/numpy 数组/数据帧/系列的形式在 fun1 中返回结果。

我也浏览过很多关于 SO 的相关帖子，仅举几例：

Pandas - Using `.rolling()` on multiple columns Returning two values from pandas.rolling_apply How to apply a function to two columns of Pandas dataframe Apply pandas function to column to create multiple new columns?

但是指定的解决方案都没有解决这个问题。

有没有直接的解决方案？

【问题讨论】：

【参考方案1】：

这是一个使用rolling 的hacky答案，生成一个DataFrame：

import pandas as pd
import numpy as np

dr = pd.date_range('09-26-2019', '10-17-2019', freq='15T')
data = np.random.rand(len(dr))

s = pd.Series(data, index=dr)

output = pd.DataFrame(columns=['a','b','c'])

row = 0

def compute(window, df):
    global row
    a = window.max()
    b = window.min()
    c = a - b
    df.loc[row,['a','b','c']] = [a,b,c]
    row+=1    
    return 1
    
s.rolling('1D').apply(compute,kwargs='df':output)

output.index = s.index

似乎rolling apply 函数总是期望返回一个数字，以便根据计算立即生成一个新系列。

我通过创建一个新的output DataFrame（带有所需的输出列）并在函数中写入它来解决这个问题。我不确定是否有办法在滚动对象中获取索引，所以我改为使用global 来增加写入新行的计数。不过，鉴于上述观点，您需要return 一些数字。所以虽然实际上rolling操作返回了一系列1，但output被修改了：

In[0]:
s

Out[0]:
2019-09-26 00:00:00    0.106208
2019-09-26 00:15:00    0.979709
2019-09-26 00:30:00    0.748573
2019-09-26 00:45:00    0.702593
2019-09-26 01:00:00    0.617028
  
2019-10-16 23:00:00    0.742230
2019-10-16 23:15:00    0.729797
2019-10-16 23:30:00    0.094662
2019-10-16 23:45:00    0.967469
2019-10-17 00:00:00    0.455361
Freq: 15T, Length: 2017, dtype: float64

In[1]:
output

Out[1]:
                           a         b         c
2019-09-26 00:00:00  0.106208  0.106208  0.000000
2019-09-26 00:15:00  0.979709  0.106208  0.873501
2019-09-26 00:30:00  0.979709  0.106208  0.873501
2019-09-26 00:45:00  0.979709  0.106208  0.873501
2019-09-26 01:00:00  0.979709  0.106208  0.873501
                      ...       ...       ...
2019-10-16 23:00:00  0.980544  0.022601  0.957943
2019-10-16 23:15:00  0.980544  0.022601  0.957943
2019-10-16 23:30:00  0.980544  0.022601  0.957943
2019-10-16 23:45:00  0.980544  0.022601  0.957943
2019-10-17 00:00:00  0.980544  0.022601  0.957943

[2017 rows x 3 columns]

这感觉更像是对rolling 的利用，而不是预期用途，所以我很想看到一个更优雅的答案。

更新：感谢@JuanPi，您可以使用this answer 获取滚动窗口索引。因此，非globalanswer 可能如下所示：

def compute(window, df):
    a = window.max()
    b = window.min()
    c = a - b
    df.loc[window.index.max(),['a','b','c']] = [a,b,c]  
    return 1

【讨论】：

你可以使用这个答案***.com/a/60918101中的技巧来获取当前窗口的索引 @JuanPi 谢谢分享，正想问这个！我更新了我的答案以包括这个不是那么 hacky，您基本上是在利用 pandas 滚动功能作为窗口生成器。您没有得到的是具有通常滚动窗口的领先 NaN，但如果需要，可以预先添加它们。【参考方案2】：

这个 hack 似乎对我有用，尽管滚动的附加功能不能应用于这个解决方案。但是，由于多处理，应用程序的速度明显更快。

from multiprocessing import Pool
import functools


def apply_fn(indices, fn, df):
    return fn(df.loc[indices])
              
    
def rolling_apply(df, fn, window_size, start=None, end=None):
    """
    The rolling application of a function fn on a DataFrame df given the window_size
    """
    x = df.index
    if start is not None:
        x = x[x >= start]
    if end is not None:
        x = x[x <= end]
    if type(window_size) == str:
        delta = pd.Timedelta(window_size)
        index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
    else: 
        assert type(window_size) == int, "Window size should be str (representing Timedelta) or int"
        delta = window_size
        index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
    
    with Pool() as pool:
        result = list(pool.map(functools.partial(apply_fn, fn=fn, df=df), index_sets))
    result = pd.DataFrame(data=result, index=x)
        
    return result

具备上述功能后，插入功能以滚动到自定义rolling_function。

result = rolling_apply(res, fun1, "21 D")

结果内容：

                    a           b           c
time            
2019-09-26 16:00:00 NaN         NaN         NaN
2019-09-26 16:15:00 0.500000    0.106350    0.196394
2019-09-26 16:30:00 0.500000    0.389759    -0.724829
2019-09-26 16:45:00 2.000000    0.141436    -0.529949
2019-09-26 17:00:00 6.010184    0.141436    -0.459231
... ... ... ...
2019-10-17 22:45:00 4.864015    0.204483    -0.761609
2019-10-17 23:00:00 6.607717    0.204647    -0.761421
2019-10-17 23:15:00 7.466364    0.204932    -0.761108
2019-10-17 23:30:00 4.412779    0.204644    -0.760386
2019-10-17 23:45:00 0.998308    0.203039    -0.757979
1830 rows × 3 columns

注意：

此实现适用于 Series 和 DataFrame 输入此实现适用于时间和整数窗口 fun1 返回的结果甚至可以是列表、numpy 数组、系列或字典 window_size 仅考虑最大窗口大小，因此所有低于window_size 的起始索引都将使其窗口包含直到起始元素的所有元素。 apply 函数不应嵌套在 rolling_apply 函数中，因为 pool.map 不能接受本地或 lambda 函数，因为根据 multiprocessing 库，它们不能被“腌制”

【讨论】：

以上是关于熊猫适用于滚动多列输出的主要内容，如果未能解决你的问题，请参考以下文章