Pandas：向量化局部范围操作（[i:i+2] 行的最大值和总和）

Posted 2023-03-11

技术标签:

【中文标题】Pandas：向量化局部范围操作（[i:i+2] 行的最大值和总和）【英文标题】：Pandas: vectorize local range operations (max & sum for [i:i+2] rows) 【发布时间】：2019-10-11 19:59:43 【问题描述】：

我希望在本地范围内对数据框中的每一行进行计算，同时避免缓慢的for 循环。例如，对于下面数据中的每一行，我想找出未来 3 天内（包括当天）的最高气温和未来 3 天内的总降雨量：

Day Temperature Rain
0   30          4
1   31          14
2   31          0
3   30          0
4   33          5
5   34          0
6   32          0
7   33          2
8   31          5
9   29          9

理想的输出将是下表中的新列。第 0 天的 TempMax 显示第 0 天和第 2 天之间的最高温度，RainTotal 显示第 0 天和第 2 天之间的降雨总和：

Day  Temperature  Rain  TempMax  RainTotal
0    30           4     31       18
1    31           14    31       14
2    31           0     33       5
3    30           0     34       5
4    33           5     34       5
5    34           0     34       2
6    32           0     33       7
7    33           2     33       16
8    31           5     31       14
9    29           9     29       9

目前我正在使用for 循环：

  # Make empty arrays to store each row's max & sum values
  temp_max = np.zeros(len(df))
  rain_total = np.zeros(len(df))

  # Loop through the df and do operations in the local range [i:i+2]
  for i in range(len(df)):
    temp_max[i] = df['Temperature'].iloc[i:i+2].max()
    rain_total = df['Rain'].iloc[i:i+2].sum()

  # Insert the arrays to df
  df['TempMax'] = temp_max
  df['RainTotal'] = rain_total

for 循环完成了工作，但我的数据框需要 50 分钟。是否有可能通过其他方式对其进行 vecrotized 或使其更快？

非常感谢！

【问题讨论】：

【参考方案1】：

对于Day 连续几天都有数据的情况，我们可以使用快速的 NumPy 和 SciPy 工具来救援 -

from scipy.ndimage.filters import maximum_filter1d

N = 2 # window length
temp = df['Temperature'].to_numpy()
rain = df['Rain'].to_numpy()
df['TempMax'] = maximum_filter1d(temp,N+1,origin=-1,mode='nearest')
df['RainTotal'] = np.convolve(rain,np.ones(N+1,dtype=int))[N:]

样本输出 -

In [27]: df
Out[27]: 
   Day  Temperature  Rain  TempMax  RainTotal
0    0           30     4       31         18
1    1           31    14       31         14
2    2           31     0       33          5
3    3           30     0       34          5
4    4           33     5       34          5
5    5           34     0       34          2
6    6           32     0       33          7
7    7           33     2       33         16
8    8           31     5       31         14
9    9           29     9       29          9

【讨论】：

这也非常快，感谢您的解决方案！但是，我确实更喜欢上一个答案的 .rolling() 方法，因为它完全是 Pandas 并且不需要额外的导入。【参考方案2】：

通过索引将Series.rolling 与变更单一起使用，将max 与sum 一起使用：

df['TempMax'] = df['Temperature'].iloc[::-1].rolling(3, min_periods=1).max()
df['RainTotal'] = df['Rain'].iloc[::-1].rolling(3, min_periods=1).sum()
print (df)
   Day  Temperature  Rain  TempMax  RainTotal
0    0           30     4     31.0       18.0
1    1           31    14     31.0       14.0
2    2           31     0     33.0        5.0
3    3           30     0     34.0        5.0
4    4           33     5     34.0        5.0
5    5           34     0     34.0        2.0
6    6           32     0     33.0        7.0
7    7           33     2     33.0       16.0
8    8           31     5     31.0       14.0
9    9           29     9     29.0        9.0

另一个更快的解决方案，在 numpy 中使用 strides 处理二维数组，然后使用 numpy.nanmax 和 numpy.nansum：

n = 2
t = np.concatenate([df['Temperature'].values, [np.nan] * (n)])
r = np.concatenate([df['Rain'].values, [np.nan] * (n)])

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

df['TempMax'] = np.nanmax(rolling_window(t, n + 1), axis=1)
df['RainTotal'] = np.nansum(rolling_window(r, n + 1), axis=1)
print (df)
   Day  Temperature  Rain  TempMax  RainTotal
0    0           30     4     31.0       18.0
1    1           31    14     31.0       14.0
2    2           31     0     33.0        5.0
3    3           30     0     34.0        5.0
4    4           33     5     34.0        5.0
5    5           34     0     34.0        2.0
6    6           32     0     33.0        7.0
7    7           33     2     33.0       16.0
8    8           31     5     31.0       14.0
9    9           29     9     29.0        9.0

性能：

#[100000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)

In [23]: %%timeit
    ...: df['TempMax'] = np.nanmax(rolling_window(t, n + 1), axis=1)
    ...: df['RainTotal'] = np.nansum(rolling_window(r, n + 1), axis=1)
    ...: 
8.36 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %%timeit
    ...: df['TempMax'] = df['Temperature'].iloc[::-1].rolling(3, min_periods=1).max()
    ...: df['RainTotal'] = df['Rain'].iloc[::-1].rolling(3, min_periods=1).sum()
    ...: 
20.4 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

【讨论】：

谢谢！这将时间从 50 分钟减少到 3 秒。太棒了！ @Tuppitappi 我的解决方案怎么样？

以上是关于Pandas：向量化局部范围操作（[i:i+2] 行的最大值和总和）的主要内容，如果未能解决你的问题，请参考以下文章