跳过每组中的前 n 行
Posted
技术标签:
【中文标题】跳过每组中的前 n 行【英文标题】:Skip first n rows in each group 【发布时间】:2021-02-18 11:06:28 【问题描述】:假设我有一个 Pandas 数据框:
df = pd.DataFrame('Company': ['company A']*5 + ['company B']*5,
'Date': ['01.01.2020', '01.02.2020', '01.03.2020', '01.04.2020', '01.05.2020'] +
['01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020'],
'Revenue': np.random.rand(1, 10)[0]*10000)
Company Date Revenue
0 company A 01.01.2020 5033.243098
1 company A 01.02.2020 5967.112256
2 company A 01.03.2020 6328.425874
3 company A 01.04.2020 7289.514777
4 company A 01.05.2020 9642.728016
5 company B 01.04.2020 805.708717
6 company B 01.05.2020 162.177508
7 company B 01.06.2020 7549.296095
8 company B 01.07.2020 4398.211089
9 company B 01.08.2020 1651.938946
目标是得到一个排除了每家公司前 N 个月的 DF:
Company Date Revenue
2 company A 01.03.2020 5731.949686
3 company A 01.04.2020 4300.537741
4 company A 01.05.2020 4283.022397
7 company B 01.06.2020 8011.727731
8 company B 01.07.2020 1935.579432
9 company B 01.08.2020 3866.649045
例如这样:
for company in df['Company'].unique():
company_df = df[df['Company'] == company].sort_values(by='Date')
ind_to_drop = company_df.iloc[:2].index
df = df.drop(ind_to_drop)
我正在寻找更有效的方法。
【问题讨论】:
【参考方案1】:你可以使用:
(df.sort_values(['Date']) # sort values by 'Date'
.groupby('Company', as_index=False) # group by 'Company'
.apply(lambda x: x.iloc[2:]) # skip first two rows
.droplevel(0)) # drop first index level
输出:
Company Date Revenue
2 company A 01.03.2020 559.525103
3 company A 01.04.2020 4692.250518
4 company A 01.05.2020 8206.546659
7 company B 01.06.2020 3519.014808
8 company B 01.07.2020 4902.521804
9 company B 01.08.2020 6533.685687
【讨论】:
【参考方案2】:我会使用groupby
来摆脱公司的重复过滤器。另外,我认为一次性删除所有索引会稍微提高性能——同样,一次扫描数据库就可以了。
ind_to_drop = list()
for _, data in df.groupby(by=['Company']):
data = data.sort_values(by='Date')
ind_to_drop += list(data.iloc[:2].index)
df = df.drop(ind_to_drop)
【讨论】:
【参考方案3】:你可以使用groupby().head()
提取索引然后drop
:
df.drop(df.sort_values(['Date']).groupby('Company').head(1).index)
输出:
Company Date Revenue
1 company A 01.02.2020 8354.050677
2 company A 01.03.2020 9867.805507
3 company A 01.04.2020 4072.178342
4 company A 01.05.2020 9626.621319
6 company B 01.05.2020 8712.769956
7 company B 01.06.2020 6751.648895
8 company B 01.07.2020 492.769737
9 company B 01.08.2020 1709.737424
【讨论】:
以上是关于跳过每组中的前 n 行的主要内容,如果未能解决你的问题,请参考以下文章