groupby中特定行的Python pandas差异

Posted

技术标签:

【中文标题】groupby中特定行的Python pandas差异【英文标题】:Python pandas difference of specific rows in groupby 【发布时间】:2020-06-19 02:29:35 【问题描述】:

我有一个熊猫数据框

df = pd.DataFrame('Firm': ['Firm1','Firm1','Firm1','Firm1','Firm1','Firm1','Firm2','Firm2','Firm2','Firm2','Firm2','Firm2'],'Location' : ['Country1', 'Country1', 'Country1', 'Country2', 'Country2', 'Country2','Country1', 'Country1', 'Country1', 'Country2', 'Country2', 'Country2'], 'Currency' : ['Curr1', 'Curr2', 'Curr3', 'Curr1', 'Curr2', 'Curr3','Curr1', 'Curr2', 'Curr3', 'Curr1', 'Curr2', 'Curr3'], 'Value' : [100, 105, 110, 100, 95, 120, 95, 110, 115, 105, 120, 90] )

看起来像这样:

df:

     Firm  Location Currency  Value
0   Firm1  Country1    Curr1    100
1   Firm1  Country1    Curr2    105
2   Firm1  Country1    Curr3    110
3   Firm1  Country2    Curr1    100
4   Firm1  Country2    Curr2     95
5   Firm1  Country2    Curr3    120
6   Firm2  Country1    Curr1     95
7   Firm2  Country1    Curr2    110
8   Firm2  Country1    Curr3    115
9   Firm2  Country2    Curr1    105
10  Firm2  Country2    Curr2    120
11  Firm2  Country2    Curr3     90

现在我想计算每个公司位置组的 Curr3 和 Curr2 (列值)之间的差异,并根据结果更改 Curr3 的值。生成的 df 应如下所示:

     Firm  Location Currency  Value
0   Firm1  Country1    Curr1    100
1   Firm1  Country1    Curr2    105
2   Firm1  Country1    Curr3      5
3   Firm1  Country2    Curr1    100
4   Firm1  Country2    Curr2     95
5   Firm1  Country2    Curr3     25
6   Firm2  Country1    Curr1     95
7   Firm2  Country1    Curr2    110
8   Firm2  Country1    Curr3      5
9   Firm2  Country2    Curr1    105
10  Firm2  Country2    Curr2    120
11  Firm2  Country2    Curr3    -30

我尝试过使用.groupby.apply,这给了我结果,但是我想在原始数据框中进行转换。

df2 = df.groupby(['Firm','Location']).apply(lambda g: g[g.Currency == 'Curr3'].Value.values[0] - g[g.Currency == 'Curr2'].Value.values[0])

df2:

Firm    Location    0
Firm1   Country1    5
Firm1   Country2    25
Firm2   Country1    5
Firm2   Country2    -30

我无法弄清楚如何在原始 df 中就地执行此操作。我也使用.transform 进行了同样的尝试,但是它会产生错误:

df2 = df.groupby(['Firm','Location']).transform(lambda g: g[g.Currency == 'Curr3'].Value.values[0] - g[g.Currency == 'Curr2'].Value.values[0])

AttributeError: ("'Series' object has no attribute 'Currency'", 'occurred at index Currency')

----根据二凡的方案更新:

newvals = (
    df.where(df['Currency'].isin(['Curr2', 'Curr3']))
      .groupby(['Firm', 'Location'])['Value'].diff()
)
df['Value'] = newvals.fillna(df['Value'])

如果 df 看起来像这样(货币未排序),则解决方案不再有效(因为 diff() 仅计算与前一个值的差异

    Firm    Location    Currency    Value
0   Firm1   Country1    Curr2   100
1   Firm1   Country1    Curr1   105
2   Firm1   Country1    Curr3   110
3   Firm1   Country2    Curr3   100
4   Firm1   Country2    Curr2   95
5   Firm1   Country2    Curr1   120
6   Firm2   Country1    Curr1   95
7   Firm2   Country1    Curr2   110
8   Firm2   Country1    Curr3   115
9   Firm2   Country2    Curr2   105
10  Firm2   Country2    Curr3   120
11  Firm2   Country2    Curr1   90

-> 结果:

    Firm    Location    Currency    Value
0   Firm1   Country1    Curr2   100.0
1   Firm1   Country1    Curr1   105.0
2   Firm1   Country1    Curr3   10.0
3   Firm1   Country2    Curr3   100.0
4   Firm1   Country2    Curr2   -5.0
5   Firm1   Country2    Curr1   120.0
6   Firm2   Country1    Curr1   95.0
7   Firm2   Country1    Curr2   110.0
8   Firm2   Country1    Curr3   5.0
9   Firm2   Country2    Curr2   105.0
10  Firm2   Country2    Curr3   15.0
11  Firm2   Country2    Curr1   90.0

现在,不再每次计算 Curr3 和 Curr 2 之间的差值并替换 Curr3 的值。

【问题讨论】:

我看到了你的编辑,我回家后会更新答案 【参考方案1】:

使用DataFrame.whereSeries.isinGroupBy.diffSeries.fillna

首先我们将所有Curr1 转换为NaNwhere,然后我们对FirmLocation 进行分组并计算Value 的差异。

newvals = (
    df.where(df['Currency'].isin(['Curr2', 'Curr3']))
      .groupby(['Firm', 'Location'])['Value'].diff()
)
df['Value'] = newvals.fillna(df['Value'])
     Firm  Location Currency  Value
0   Firm1  Country1    Curr1  100.0
1   Firm1  Country1    Curr2  105.0
2   Firm1  Country1    Curr3    5.0
3   Firm1  Country2    Curr1  100.0
4   Firm1  Country2    Curr2   95.0
5   Firm1  Country2    Curr3   25.0
6   Firm2  Country1    Curr1   95.0
7   Firm2  Country1    Curr2  110.0
8   Firm2  Country1    Curr3    5.0
9   Firm2  Country2    Curr1  105.0
10  Firm2  Country2    Curr2  120.0
11  Firm2  Country2    Curr3  -30.0

【讨论】:

太棒了,这正是我想要的!非常感谢您的快速解决方案!但是, .diff() 仅计算沿轴与前一项的差异。想象一下这样一种情况,您希望获得多于一对不一定相邻的行的差异(例如 Curr2-Curr3 和 Curr 3-Curr1)。或者,如果您想指定要从中减去哪个值(Curr2-Curr3,反之亦然)。你将如何实现它? 您能否编辑或添加我当前解决方案失败的数据集的另一个示例。我想我明白你的意思,但最好确定一下,所以我会提供足够的解决方案。

以上是关于groupby中特定行的Python pandas差异的主要内容,如果未能解决你的问题,请参考以下文章

使用 selection & groupby (python) 维护 pandas df 索引

5000 万行的 Pandas groupby+transform 需要 3 小时

Pandas groupby计算每n行的平均值

应用 groupby 后从组中获取特定元素-PANDAS [重复]

python pandas - 处理嵌套 groupby 的最佳方法

在不包括当前行的两列之间使用pandas groupby除法创建一个新列