Pandas 更快地将 pd.pct_change 应用于多个列和多个间隔
Posted
技术标签:
【中文标题】Pandas 更快地将 pd.pct_change 应用于多个列和多个间隔【英文标题】:Pandas faster apply pd.pct_change for multiple columns and multiple intervals 【发布时间】:2019-07-20 16:03:54 【问题描述】:我正在循环通过一个多索引的 Pandas 数据框来为每列生成历史百分比变化列。索引的第一级是日期。索引的第二级是符号。输入数据的头部是:
price_open price_high price_low price_close volume price_adj_close
date symbol
1962-01-02 AA 65.37 65.75 65.37 65.37 134400 0.70
1962-01-03 AA 65.37 66.37 65.25 66.37 179200 0.71
1962-01-04 AA 66.37 66.87 66.37 66.37 193600 0.71
1962-01-05 AA 66.37 66.75 66.12 66.25 169600 0.71
1962-01-08 AA 66.00 66.00 63.50 64.00 225600 0.68
理想的输出应该为每个输入列包含一系列列。我的输出数据框相当宽,但这里是完整数据框的列名列表:
索引(['price_open', 'price_high', 'price_low', 'price_close', 'volume', 'price_adj_close', 'price_open_1d_pct', 'price_open_3d_pct', 'price_open_5d_pct'、'price_open_10d_pct'、'price_open_15d_pct'、 'price_open_30d_pct', 'price_high_1d_pct', 'price_high_3d_pct', 'price_high_5d_pct'、'price_high_10d_pct'、'price_high_15d_pct'、 'price_high_30d_pct'、'price_low_1d_pct'、'price_low_3d_pct'、 'price_low_5d_pct'、'price_low_10d_pct'、'price_low_15d_pct'、 'price_low_30d_pct', 'price_close_1d_pct', 'price_close_3d_pct', 'price_close_5d_pct', 'price_close_10d_pct', 'price_close_15d_pct', 'price_close_30d_pct', 'volume_1d_pct', 'volume_3d_pct', 'volume_5d_pct', 'volume_10d_pct', 'volume_15d_pct', 'volume_30d_pct', 'price_adj_close_1d_pct', 'price_adj_close_3d_pct', 'price_adj_close_5d_pct', 'price_adj_close_10d_pct', 'price_adj_close_15d_pct', 'price_adj_close_30d_pct', 'price_7d_future'], dtype='object')
这是输出数据帧的头部:
price_open price_high price_low price_close volume price_adj_close price_open_1d_pct price_open_3d_pct price_open_5d_pct price_open_10d_pct price_open_15d_pct price_open_30d_pct price_high_1d_pct price_high_3d_pct price_high_5d_pct price_high_10d_pct price_high_15d_pct price_high_30d_pct price_low_1d_pct price_low_3d_pct price_low_5d_pct price_low_10d_pct price_low_15d_pct price_low_30d_pct price_close_1d_pct price_close_3d_pct price_close_5d_pct price_close_10d_pct price_close_15d_pct price_close_30d_pct volume_1d_pct volume_3d_pct volume_5d_pct volume_10d_pct volume_15d_pct volume_30d_pct price_adj_close_1d_pct price_adj_close_3d_pct price_adj_close_5d_pct price_adj_close_10d_pct price_adj_close_15d_pct price_adj_close_30d_pct price_7d_future
date symbol
1962-02-13 AA 58.75 59.13 58.75 58.88 150400 0.63 0.008584 -0.006427 -0.006427 -0.010610 -0.028926 -0.101270 0.012847 -0.004210 0.000000 -0.012525 -0.032717 -0.100684 0.010666 0.004274 0.008584 -0.010610 -0.022950 -0.101270 0.012902 0.000000 0.002213 -0.016700 -0.020788 -0.099281 2.760000 0.205128 0.540984 1.043478 0.807692 0.119048 0.016129 0.000000 0.000000 -0.015625 -0.015625 -0.100000 0.031746
1962-02-14 AA 58.50 58.50 57.63 58.00 136000 0.62 -0.004255 -0.006454 -0.006454 -0.023046 -0.027108 -0.105094 -0.010654 -0.012658 -0.016807 -0.025000 -0.027108 -0.118578 -0.019064 -0.010644 -0.021230 -0.033540 -0.014872 -0.116782 -0.014946 -0.004292 -0.019110 -0.029289 -0.029289 -0.126111 -0.095745 0.666667 0.231884 0.011905 0.231884 -0.241071 -0.015873 0.000000 -0.015873 -0.031250 -0.031250 -0.126761 0.048387
1962-02-15 AA 58.00 59.00 57.50 57.50 150400 0.62 -0.008547 -0.004292 -0.019110 -0.027335 -0.029289 -0.126111 0.008547 0.010620 -0.006399 -0.010565 -0.014696 -0.117691 -0.002256 -0.010838 -0.017094 -0.021277 -0.027566 -0.133645 -0.008621 -0.010838 -0.023438 -0.031660 -0.027566 -0.133645 0.105882 2.760000 0.205128 -0.078431 0.649123 -0.223140 0.000000 0.000000 -0.015873 -0.015873 -0.015873 -0.126761 0.048387
1962-02-16 AA 57.50 58.38 57.50 58.38 134400 0.62 -0.008621 -0.021277 -0.023438 -0.031660 -0.027566 -0.133645 -0.010508 -0.012684 -0.014684 -0.022929 -0.016841 -0.125393 0.000000 -0.021277 -0.012876 -0.025424 -0.027566 -0.130369 0.015304 -0.008492 0.002232 -0.012684 -0.016841 -0.118792 -0.106383 -0.106383 0.647059 0.826087 0.473684 -0.207547 0.000000 -0.015873 0.000000 -0.015873 -0.015873 -0.126761 0.048387
1962-02-19 AA 58.50 59.00 58.50 58.88 72000 0.63 0.017391 0.000000 0.004292 -0.016807 -0.014820 -0.113636 0.010620 0.008547 0.010620 -0.014696 -0.012552 -0.106061 0.017391 0.015096 0.006365 -0.016807 -0.010654 -0.078740 0.008565 0.015172 0.012902 -0.010420 -0.006245 -0.080000 -0.464286 -0.470588 0.800000 -0.587156 -0.296875 -0.680851 0.016129 0.016129 0.016129 0.000000 0.000000 -0.073529 0.063492
以下代码运行速度很慢,因为有数百万条记录,我不知道如何加快速度。任何人都可以提供一些编码技巧来加快这段代码的速度吗?
features_targets_df = pd.DataFrame()
for s in df.index.unique(level='symbol'):
stock_df = df.iloc[df.index.get_level_values('symbol') == s].copy()
for c in stock_df:
for n in [1, 3, 5, 10, 15, 30]: # make day-change columns
stock_df['_d_pct'.format(c, str(n))] = stock_df[c].pct_change(n)
stock_df = stock_df.replace([np.inf, -np.inf], np.nan)
stock_df['price_7d_future'] = stock_df['price_adj_close'].shift(-7).pct_change(7)
features_targets_df = features_targets_df.append(stock_df)
【问题讨论】:
请发布code-formatted
数据而不是图片,以便我们直接复制。另外,请提供您的预期输出。
根据您的建议和其他一些内容编辑了帖子
在你的代码中你有stock_name_column
。我可以假设这与您输入数据中的索引级别symbol
相同吗?
stock_price_column
指的是哪一栏? target_time_prediction
?代码很受欢迎,但如果它是可执行的会更好。
抱歉,我现在(以及其他一些)取出了 stock_name_column 和 target_time_prediction 变量。
【参考方案1】:
当我最初试图加快速度时,我使用了您的数据,但 5 行不足以支持实际证据。因此,我创建了一个更大的数据框(1k 行),其中包含 2 个符号,使用的格式与您已有的相同。这是复制我的测试数据的代码:
import pandas as pd # version 0.23.4
import numpy as np # version 1.15.4
np.random.seed(1)
df1 = pd.DataFrame(index=[
pd.date_range(start='1962-01-02', periods=1000, freq='D'),
['AA']*500 + ['BB']*500
], columns=[
'price_open',
'price_high',
'price_low',
'price_close',
'volume',
'price_adj_close'
], data=np.random.random(size=(1000, 6)))
df1.index.names = ['date', 'symbol']
我使用这些新数据为您的原始代码计时:
%%timeit
features_targets_df = pd.DataFrame()
for s in df1.index.unique(level='symbol'):
stock_df = df1.iloc[df.index.get_level_values('symbol') == s].copy()
for c in stock_df:
for n in [1, 3, 5, 10, 15, 30]: # make day-change columns
stock_df['_d_pct'.format(c, str(n))] = stock_df[c].pct_change(n)
stock_df = stock_df.replace([np.inf, -np.inf], np.nan)
stock_df['price_7d_future'] = stock_df['price_adj_close'].shift(-2).pct_change(2)
features_targets_df = features_targets_df.append(stock_df)
输出
159 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
我的代码用groupby()
和apply()
替换了一些for-loops
:
%%timeit
# copy original df instead of defining empty one
features_targets_df = df1.copy()
# loop through the day changes
for n in [1, 3, 5, 10, 15, 30]:
# groupby the "symbol" index level
# focus on the necessary columns (otherwise the second + loops will calculate pct_change across all new columns)
# apply pct_change to each series using lambda
# add suffix to the new columns (I used f-strings because I'm using Python 3.7.1, but these became available in 3.6)
# replace +/- infinity with nan
# join to features_targets_df
features_targets_df = features_targets_df.join(features_targets_df.groupby(level='symbol')[
[
'price_open',
'price_high',
'price_low',
'price_close',
'volume',
'price_adj_close'
]
].apply(lambda x : x.pct_change(n)).add_suffix(f"_nd_pct")).replace([np.inf, -np.inf], np.nan)
# groupby "symbol" index level and calculate 7d future
features_targets_df['price_7d_future'] = features_targets_df.groupby(level='symbol').price_adj_close.shift(-2).pct_change(2)
输出
88.4 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
将时间缩短近 50%。希望这会有所帮助!
【讨论】:
以上是关于Pandas 更快地将 pd.pct_change 应用于多个列和多个间隔的主要内容,如果未能解决你的问题,请参考以下文章