递归计算 DataFrame 值
Posted
技术标签:
【中文标题】递归计算 DataFrame 值【英文标题】:Calculate DataFrame values recursively 【发布时间】:2017-04-30 18:45:22 【问题描述】:我正在尝试“递归”计算熊猫数据框的列值。
假设有两个不同天的数据,每个天有 10 个观测值,并且您想要计算一些变量 r,其中仅给出 r 的第一个值(在每一天),并且您想要计算剩余的 2*9 个条目,而每个后续值取决于 r 的前一个条目和一个额外的“同时期”变量“x”。
第一个问题是我想单独执行每一天的计算,即我想使用 pandas.groupby()
函数进行所有计算...但是当我尝试对数据进行子集化并使用 @987654325 @函数,我只得到“NaN”条目
data.groupby(data.index)['r'] = ( (1+data.groupby(data.index)['x']*0.25) * (1+data.groupby(data.index)['r'].shift(1)))
对于我的第二种方法,我使用了一个 for 循环来遍历索引(日期):
for i in range(2,21):
data[data['rank'] == i]['r'] = ( (1+data[data['rank'] == i]['x']*0.25) * (1+data[data['rank'] == i]['r'].shift(1))
但是,这对我不起作用。有没有办法在 DataFrames 上执行这样的计算?也许像滚动应用之类的东西?
数据:
df = pd.DataFrame(
'rank' : [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
'x' : [0.00275,0.00285,0.0031,0.0036,0.0043,0.0052,0.0063,0.00755,0.00895,0.0105,0.0027,0.00285,0.0031,0.00355,0.00425,0.0051,0.00615,0.00735,0.00875,0.0103],
'r' : [0.00158,'NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',0.001485,'NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN']
,index=['2014-01-02', '2014-01-02', '2014-01-02', '2014-01-02',
'2014-01-02', '2014-01-02', '2014-01-02', '2014-01-02',
'2014-01-02', '2014-01-02', '2014-01-03', '2014-01-03',
'2014-01-03', '2014-01-03', '2014-01-03', '2014-01-03',
'2014-01-03', '2014-01-03', '2014-01-03', '2014-01-03'])
【问题讨论】:
对此有很多疑问,但目前还没有快速的方法在 pandas 中进行此类递归关系计算。你必须循环。有an open issue关于它。 【参考方案1】:要进行滚动申请,您可以使用pandas.groupby().apply()
。在 apply 内部,您可以使用循环来进行每组的计算。内部循环也可以使用scipy.lfilter
完成,但我无法理解您所追求的确切公式,所以我只是跳过了那部分。
代码:
def rolling_apply(group):
r = [group.r.iloc[0]]
for x in group.x:
r.append((1 + r[-1]) * (1 + x * 0.25))
group.r = r[1:]
return group
df['R'] = df.groupby(df.index).apply(rolling_apply).r
结果:
r rank x R
2014-01-02 0.00158 1 0.00275 1.002269
2014-01-02 NaN 2 0.00285 2.003695
2014-01-02 NaN 3 0.00310 3.006023
2014-01-02 NaN 4 0.00360 4.009628
2014-01-02 NaN 5 0.00430 5.015014
2014-01-02 NaN 6 0.00520 6.022833
2014-01-02 NaN 7 0.00630 7.033894
2014-01-02 NaN 8 0.00755 8.049058
2014-01-02 NaN 9 0.00895 9.069306
2014-01-02 NaN 10 0.01050 10.095737
2014-01-03 0.001485 1 0.00270 1.002161
2014-01-03 NaN 2 0.00285 2.003588
2014-01-03 NaN 3 0.00310 3.005915
2014-01-03 NaN 4 0.00355 4.009471
2014-01-03 NaN 5 0.00425 5.014793
2014-01-03 NaN 6 0.00510 6.022462
2014-01-03 NaN 7 0.00615 7.033259
2014-01-03 NaN 8 0.00735 8.048020
2014-01-03 NaN 9 0.00875 9.067813
2014-01-03 NaN 10 0.01030 10.093737
测试数据:
df = pd.DataFrame(
'rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x': [0.00275, 0.00285, 0.0031, 0.0036, 0.0043, 0.0052, 0.0063, 0.00755,
0.00895, 0.0105, 0.0027, 0.00285, 0.0031, 0.00355, 0.00425,
0.0051, 0.00615, 0.00735, 0.00875, 0.0103],
'r': [0.00158, 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN',
'NaN', 0.001485, 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN',
'NaN', 'NaN']
, index=['2014-01-02', '2014-01-02', '2014-01-02', '2014-01-02',
'2014-01-02', '2014-01-02', '2014-01-02', '2014-01-02',
'2014-01-02', '2014-01-02', '2014-01-03', '2014-01-03',
'2014-01-03', '2014-01-03', '2014-01-03', '2014-01-03',
'2014-01-03', '2014-01-03', '2014-01-03', '2014-01-03'])
更新:
既然已知所需的实际递归方程,以下是 apply 函数的更新:
def rolling_apply(group):
r = [group.r.iloc[0]]
for x in group.x[:-1]:
r.append((1 + r[-1]) * (1 + x * 0.25) - 1)
group.r = r
return group
df.r = df.groupby(df.index).apply(rolling_apply).r
【讨论】:
是否也可以将其作为 lambda 函数执行并应用?我只是好奇。 @moondra,OP 的计算有点复杂,因此即使可以用 lambda 表示,我认为这将是糟糕的编码风格。 在执行时间方面与 for 循环相比如何? @SuperCodeBrah。不完全确定您在问什么,但 .apply 本质上是一个 Python for 循环。 好的,感谢您的澄清。我认为是这种情况,但只是检查一下,因为据我所知,numpy 在后台使用了 C。我想将 apply 与单个函数一起使用(而不是for
循环)会更优雅一些,但我希望也能带来速度优势。【参考方案2】:
Stephen Rauch 的回答非常有帮助。由于我正在寻找一列“r”,其中仅计算每天的连续值而初始值(0.00158,0.001485)保持不变,因此我将另外发布最终解决方案(以防万一有人遇到类似问题)。 在 Stephen Rauch 的解决方案中,R[0] 的值属于 r[1] 等。因此,除了 1 之外,必须移动所有“等级”的数据。
测试数据
df = pd.DataFrame(
'rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x': [0.00275, 0.00285, 0.0031, 0.0036, 0.0043, 0.0052, 0.0063, 0.00755,
0.00895, 0.0105, 0.0027, 0.00285, 0.0031, 0.00355, 0.00425,
0.0051, 0.00615, 0.00735, 0.00875, 0.0103],
'r': [0.00158, 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN',
'NaN', 0.001485, 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN',
'NaN', 'NaN'] , index=['2014-01-02', '2014-01-02', '2014-01-02', '2014-01-02',
'2014-01-02', '2014-01-02', '2014-01-02', '2014-01-02',
'2014-01-02', '2014-01-02', '2014-01-03', '2014-01-03',
'2014-01-03', '2014-01-03', '2014-01-03', '2014-01-03',
'2014-01-03', '2014-01-03', '2014-01-03', '2014-01-03'])
代码
def rolling_apply(group):
r = [group.r.iloc[0]]
for x in group.x:
r.append((1 + r[-1]) * (1 + x * 0.25) -1)
group.r = r[1:]
return group
df['R'] = df.groupby(df.index).apply(rolling_apply).r
df['r'] = np.where(df['rank']==1,df['r'],df['R'].shift(1) )
df = df.drop('R',1)
结果
r rank x
2014-01-02 0.00158 1 0.00275
2014-01-02 0.00226859 2 0.00285
2014-01-02 0.0029827 3 0.00310
2014-01-02 0.00376001 4 0.00360
2014-01-02 0.0046634 5 0.00430
2014-01-02 0.00574341 6 0.00520
2014-01-02 0.00705088 7 0.00630
2014-01-02 0.00863698 8 0.00755
2014-01-02 0.0105408 9 0.00895
2014-01-02 0.0128019 10 0.01050
2014-01-03 0.001485 1 0.00270
2014-01-03 0.002161 2 0.00285
2014-01-03 0.00287504 3 0.00310
2014-01-03 0.00365227 4 0.00355
2014-01-03 0.00454301 5 0.00425
2014-01-03 0.00561034 6 0.00510
2014-01-03 0.00689249 7 0.00615
2014-01-03 0.00844059 8 0.00735
2014-01-03 0.0102936 9 0.00875
2014-01-03 0.0125036 10 0.01030
【讨论】:
以上是关于递归计算 DataFrame 值的主要内容,如果未能解决你的问题,请参考以下文章