groupby中特定行的Python pandas差异
Posted
技术标签:
【中文标题】groupby中特定行的Python pandas差异【英文标题】:Python pandas difference of specific rows in groupby 【发布时间】:2020-06-19 02:29:35 【问题描述】:我有一个熊猫数据框
df = pd.DataFrame('Firm': ['Firm1','Firm1','Firm1','Firm1','Firm1','Firm1','Firm2','Firm2','Firm2','Firm2','Firm2','Firm2'],'Location' : ['Country1', 'Country1', 'Country1', 'Country2', 'Country2', 'Country2','Country1', 'Country1', 'Country1', 'Country2', 'Country2', 'Country2'], 'Currency' : ['Curr1', 'Curr2', 'Curr3', 'Curr1', 'Curr2', 'Curr3','Curr1', 'Curr2', 'Curr3', 'Curr1', 'Curr2', 'Curr3'], 'Value' : [100, 105, 110, 100, 95, 120, 95, 110, 115, 105, 120, 90] )
看起来像这样:
df:
Firm Location Currency Value
0 Firm1 Country1 Curr1 100
1 Firm1 Country1 Curr2 105
2 Firm1 Country1 Curr3 110
3 Firm1 Country2 Curr1 100
4 Firm1 Country2 Curr2 95
5 Firm1 Country2 Curr3 120
6 Firm2 Country1 Curr1 95
7 Firm2 Country1 Curr2 110
8 Firm2 Country1 Curr3 115
9 Firm2 Country2 Curr1 105
10 Firm2 Country2 Curr2 120
11 Firm2 Country2 Curr3 90
现在我想计算每个公司位置组的 Curr3 和 Curr2 (列值)之间的差异,并根据结果更改 Curr3 的值。生成的 df 应如下所示:
Firm Location Currency Value
0 Firm1 Country1 Curr1 100
1 Firm1 Country1 Curr2 105
2 Firm1 Country1 Curr3 5
3 Firm1 Country2 Curr1 100
4 Firm1 Country2 Curr2 95
5 Firm1 Country2 Curr3 25
6 Firm2 Country1 Curr1 95
7 Firm2 Country1 Curr2 110
8 Firm2 Country1 Curr3 5
9 Firm2 Country2 Curr1 105
10 Firm2 Country2 Curr2 120
11 Firm2 Country2 Curr3 -30
我尝试过使用.groupby
和.apply
,这给了我结果,但是我想在原始数据框中进行转换。
df2 = df.groupby(['Firm','Location']).apply(lambda g: g[g.Currency == 'Curr3'].Value.values[0] - g[g.Currency == 'Curr2'].Value.values[0])
df2:
Firm Location 0
Firm1 Country1 5
Firm1 Country2 25
Firm2 Country1 5
Firm2 Country2 -30
我无法弄清楚如何在原始 df 中就地执行此操作。我也使用.transform
进行了同样的尝试,但是它会产生错误:
df2 = df.groupby(['Firm','Location']).transform(lambda g: g[g.Currency == 'Curr3'].Value.values[0] - g[g.Currency == 'Curr2'].Value.values[0])
AttributeError: ("'Series' object has no attribute 'Currency'", 'occurred at index Currency')
----根据二凡的方案更新:
newvals = (
df.where(df['Currency'].isin(['Curr2', 'Curr3']))
.groupby(['Firm', 'Location'])['Value'].diff()
)
df['Value'] = newvals.fillna(df['Value'])
如果 df 看起来像这样(货币未排序),则解决方案不再有效(因为 diff() 仅计算与前一个值的差异
Firm Location Currency Value
0 Firm1 Country1 Curr2 100
1 Firm1 Country1 Curr1 105
2 Firm1 Country1 Curr3 110
3 Firm1 Country2 Curr3 100
4 Firm1 Country2 Curr2 95
5 Firm1 Country2 Curr1 120
6 Firm2 Country1 Curr1 95
7 Firm2 Country1 Curr2 110
8 Firm2 Country1 Curr3 115
9 Firm2 Country2 Curr2 105
10 Firm2 Country2 Curr3 120
11 Firm2 Country2 Curr1 90
-> 结果:
Firm Location Currency Value
0 Firm1 Country1 Curr2 100.0
1 Firm1 Country1 Curr1 105.0
2 Firm1 Country1 Curr3 10.0
3 Firm1 Country2 Curr3 100.0
4 Firm1 Country2 Curr2 -5.0
5 Firm1 Country2 Curr1 120.0
6 Firm2 Country1 Curr1 95.0
7 Firm2 Country1 Curr2 110.0
8 Firm2 Country1 Curr3 5.0
9 Firm2 Country2 Curr2 105.0
10 Firm2 Country2 Curr3 15.0
11 Firm2 Country2 Curr1 90.0
现在,不再每次计算 Curr3 和 Curr 2 之间的差值并替换 Curr3 的值。
【问题讨论】:
我看到了你的编辑,我回家后会更新答案 【参考方案1】:使用DataFrame.where
、Series.isin
、GroupBy.diff
和Series.fillna
:
首先我们将所有Curr1
转换为NaN
和where
,然后我们对Firm
和Location
进行分组并计算Value
的差异。
newvals = (
df.where(df['Currency'].isin(['Curr2', 'Curr3']))
.groupby(['Firm', 'Location'])['Value'].diff()
)
df['Value'] = newvals.fillna(df['Value'])
Firm Location Currency Value
0 Firm1 Country1 Curr1 100.0
1 Firm1 Country1 Curr2 105.0
2 Firm1 Country1 Curr3 5.0
3 Firm1 Country2 Curr1 100.0
4 Firm1 Country2 Curr2 95.0
5 Firm1 Country2 Curr3 25.0
6 Firm2 Country1 Curr1 95.0
7 Firm2 Country1 Curr2 110.0
8 Firm2 Country1 Curr3 5.0
9 Firm2 Country2 Curr1 105.0
10 Firm2 Country2 Curr2 120.0
11 Firm2 Country2 Curr3 -30.0
【讨论】:
太棒了,这正是我想要的!非常感谢您的快速解决方案!但是, .diff() 仅计算沿轴与前一项的差异。想象一下这样一种情况,您希望获得多于一对不一定相邻的行的差异(例如 Curr2-Curr3 和 Curr 3-Curr1)。或者,如果您想指定要从中减去哪个值(Curr2-Curr3,反之亦然)。你将如何实现它? 您能否编辑或添加我当前解决方案失败的数据集的另一个示例。我想我明白你的意思,但最好确定一下,所以我会提供足够的解决方案。以上是关于groupby中特定行的Python pandas差异的主要内容,如果未能解决你的问题,请参考以下文章
使用 selection & groupby (python) 维护 pandas df 索引
5000 万行的 Pandas groupby+transform 需要 3 小时
应用 groupby 后从组中获取特定元素-PANDAS [重复]