Python pandas：对分组的第一行和最后一行应用操作并将结果添加为列的最佳方法是啥？

Posted 2023-03-25

技术标签:

【中文标题】Python pandas：对分组的第一行和最后一行应用操作并将结果添加为列的最佳方法是啥？【英文标题】：Python pandas: what is the best way to apply operations on the first and last row of a grouping, adding the results as a column?Python pandas：对分组的第一行和最后一行应用操作并将结果添加为列的最佳方法是什么？ 【发布时间】：2021-09-19 06:10:52 【问题描述】：

我有一组记录工人的 startttime 和 stoptime。我试图将工人的第一次开始时间减去他们班次中的最后一次停止时间，但我不确定如何正确利用Python 中的first() 和last() 函数。下面是数据框work：

     worker  veh   shift_id              starttime                stoptime
0  11133y   QQUK1   111333         2018-12-21 15:17:29     2018-12-21 15:18:57
1  44706h   FF243   447064         2019-01-01 00:10:16     2019-01-01 00:16:32
2  44706h   FF243   447064         2019-01-01 00:27:11     2019-01-01 00:31:38
3  44706h   FF243   447064         2019-01-01 00:46:20     2019-01-01 01:04:54
4  44761y   LL525   447617         2019-01-01 00:19:06     2019-01-01 00:39:43
5  44842q   OO454   448429         2019-01-01 00:12:35     2019-01-01 00:19:09
6  44842q   OO454   448429         2019-01-01 00:47:55     2019-01-01 01:00:01
7  44842q   OO454   448429         2019-01-01 01:12:47     2019-01-01 02:01:50
8  46090u   OP324   460908         2019-01-01 00:16:23     2019-01-01 00:39:46
9  46090u   OP324   460908         2019-01-01 00:58:02     2019-01-01 01:19:02

我希望得到这样的输出：

     worker  veh   shift_id              starttime                stoptime       hrs_per_gig
0  11133y   QQUK1   111333         2018-12-21 15:17:29     2018-12-21 15:18:57       .0010
1  44706h   FF243   447064         2019-01-01 00:10:16     2019-01-01 00:16:32       .0379
2  44706h   FF243   447064         2019-01-01 00:27:11     2019-01-01 00:31:38       .0379
3  44706h   FF243   447064         2019-01-01 00:46:20     2019-01-01 01:04:54       .0379
4  44761y   LL525   447617         2019-01-01 00:19:06     2019-01-01 00:39:43       .0143
5  44842q   OO454   448429         2019-01-01 00:12:35     2019-01-01 00:19:09       .0758
6  44842q   OO454   448429         2019-01-01 00:47:55     2019-01-01 01:00:01       .0758
7  44842q   OO454   448429         2019-01-01 01:12:47     2019-01-01 02:01:50       .0758
8  46090u   OP324   460908         2019-01-01 00:16:23     2019-01-01 00:39:46       .0435
9  46090u   OP324   460908         2019-01-01 00:58:02     2019-01-01 01:19:02       .0435

在 R 中使用 data.table 包，这很简单。我做这样的事情：

#my grouping variables
group_by = c('worker', 'veh', shift_id)

#produce a new column that calculates difference in first and last work times in hours
work[
     ,hrs_per_gig:=as.numeric(difftime(last(stoptime),first(starttime), units = "hours"))
     ,group_by]

我不确定如何在Python 中获得相同的结果。我尝试了以下方法：

#my grouping variables
group_by = ['worker', 'veh', 'shift_id']

#produce a new column that calculates difference in first and last work times in hours
work['hrs_per_gig'] = df.groupby(group_by).last('stoptime') - 
df.groupby(group_by['starttime'].first()

但我收到了一个错误ValueError: cannot join with no overlapping index names。任何建议将不胜感激。谢谢。

【问题讨论】：

【参考方案1】：

您可以通过以下方式获取timedelta对象的组件：

grp = df.groupby(group_by)
duration_per_gig = (grp['stoptime'].last() - 
                    grp['starttime'].first()).dt.components

例子：

In [56]: duration_per_gig = (grp['stoptime'].last() - grp['starttime'].first()).dt.components

In [57]: duration_per_gig                                                                                                                                                                                                                                             
Out[57]: 
                       days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
worker veh   shift_id                                                                        
11133y QQUK1 111333       0      0        1       28             0             0            0
44706h FF243 447064       0      0       54       38             0             0            0
44761y LL525 447617       0      0       20       37             0             0            0
44842q OO454 448429       0      1       49       15             0             0            0
46090u OP324 460908       0      1        2       39             0             0            0

【讨论】：

不幸的是，这种方法的问题是它创建了一个我必须重新加入的新对象。我希望像R 那样在一行中运行此操作，并将列直接添加到现有数据帧中。啊，是的，我听说了。如果可以帮助的话，我不喜欢在 DataFrame 的任何地方复制数据。您可以查看.transform()，看看它是否能满足您的需求。

以上是关于Python pandas：对分组的第一行和最后一行应用操作并将结果添加为列的最佳方法是啥？的主要内容，如果未能解决你的问题，请参考以下文章