减去熊猫(Python)中按id分组的数据框中的连续行
Posted
技术标签:
【中文标题】减去熊猫(Python)中按id分组的数据框中的连续行【英文标题】:Subtract successive rows in a dataframe grouped by id in pandas(Python) 【发布时间】:2016-10-11 08:01:54 【问题描述】:我有以下数据框:
id day total_amount
1 2015-07-09 1000
1 2015-10-22 100
1 2015-11-12 200
1 2015-11-27 2392
1 2015-12-16 123
7 2015-07-09 200
7 2015-07-09 1000
7 2015-08-27 100018
7 2015-11-25 1000
8 2015-08-27 1000
8 2015-12-07 10000
8 2016-01-18 796
8 2016-03-31 10000
15 2015-09-10 1500
15 2015-09-30 1000
如果它们具有相同的 id,我需要在 day 列中每两个连续的时间减去一次,直到到达该 id 的最后一行,然后这次开始减去 day 列中的时间以获得新的 id,类似于输出中的以下行:
1 2015-08-09 1000 2015-11-22 - 2015-08-09
1 2015-11-22 100 2015-12-12 - 2015-11-22
1 2015-12-12 200 2015-12-16 - 2015-12-12
1 2015-12-16 2392 2015-12-27 - 2015-12-27
1 2015-12-27 123 NA
7 2015-08-09 200 2015-09-09 - 2015-08-09
7 2015-09-09 1000 2015-09-27 - 2015-09-09
7 2015-09-27 100018 2015-12-25 - 2015-09-27
7 2015-12-25 1000 NA
8 2015-08-27 1000 2015-12-07 - 2015-08-27
8 2015-12-07 10000 2016-02-18 - 2015-12-07
8 2016-02-18 796 2016-04-31- 2016-02-18
8 2016-04-31 10000 NA
15 2015-10-10 1500 2015-10-30 - 2015-10-10
15 2015-10-30 1000 NA
【问题讨论】:
@exp1orer 感谢您的帮助 @AMM 非常感谢您的帮助 【参考方案1】:你可以使用DataFrameGroupBy.diff
:
df['dif'] = df.groupby('id')['day'].diff(-1) * (-1)
print (df)
id day total_amount dif
0 1 2015-07-09 1000 105 days
1 1 2015-10-22 100 21 days
2 1 2015-11-12 200 15 days
3 1 2015-11-27 2392 19 days
4 1 2015-12-16 123 NaT
5 7 2015-07-09 200 0 days
6 7 2015-07-09 1000 49 days
7 7 2015-08-27 100018 90 days
8 7 2015-11-25 1000 NaT
9 8 2015-08-27 1000 102 days
10 8 2015-12-07 10000 42 days
11 8 2016-01-18 796 73 days
12 8 2016-03-31 10000 NaT
13 15 2015-09-10 1500 20 days
14 15 2015-09-30 1000 NaT
apply
shift
的另一种解决方案:
df['diff'] = df.groupby('id')['day'].apply(lambda x: x.shift(-1) - x)
print (df)
id day total_amount diff
0 1 2015-07-09 1000 105 days
1 1 2015-10-22 100 21 days
2 1 2015-11-12 200 15 days
3 1 2015-11-27 2392 19 days
4 1 2015-12-16 123 NaT
5 7 2015-07-09 200 0 days
6 7 2015-07-09 1000 49 days
7 7 2015-08-27 100018 90 days
8 7 2015-11-25 1000 NaT
9 8 2015-08-27 1000 102 days
10 8 2015-12-07 10000 42 days
11 8 2016-01-18 796 73 days
12 8 2016-03-31 10000 NaT
13 15 2015-09-10 1500 20 days
14 15 2015-09-30 1000 NaT
通过评论编辑:
如果您需要hours
与int
的差异,请将timedelta
转换为hour
:
df['diff'] = df.groupby('id')['day'].diff(-1) * (-1) / np.timedelta64(1, 'h')
print (df)
id day total_amount diff
0 1 2015-07-09 1000 2520.0
1 1 2015-10-22 100 504.0
2 1 2015-11-12 200 360.0
3 1 2015-11-27 2392 456.0
4 1 2015-12-16 123 NaN
5 7 2015-07-09 200 0.0
6 7 2015-07-09 1000 1176.0
7 7 2015-08-27 100018 2160.0
8 7 2015-11-25 1000 NaN
9 8 2015-08-27 1000 2448.0
10 8 2015-12-07 10000 1008.0
11 8 2016-01-18 796 1752.0
12 8 2016-03-31 10000 NaN
13 15 2015-09-10 1500 480.0
14 15 2015-09-30 1000 NaN
df['diff'] = df.groupby('id')['day'].apply(lambda x: x.shift(-1) - x) /
np.timedelta64(1, 'h')
print (df)
id day total_amount diff
0 1 2015-07-09 1000 2520.0
1 1 2015-10-22 100 504.0
2 1 2015-11-12 200 360.0
3 1 2015-11-27 2392 456.0
4 1 2015-12-16 123 NaN
5 7 2015-07-09 200 0.0
6 7 2015-07-09 1000 1176.0
7 7 2015-08-27 100018 2160.0
8 7 2015-11-25 1000 NaN
9 8 2015-08-27 1000 2448.0
10 8 2015-12-07 10000 1008.0
11 8 2016-01-18 796 1752.0
12 8 2016-03-31 10000 NaN
13 15 2015-09-10 1500 480.0
14 15 2015-09-30 1000 NaN
【讨论】:
逻辑似乎正确但我需要时差 绝对是最好的答案,如果我想有其他单位的时差,我该如何更改代码,例如我有这两次:2015-10-22 08:45:30 和 2015 -07-09 10:11:47 我怎样才能找到小时而不是天的差异dtypes
列 diff
- int
或 timedelta
需要什么?
当我运行 df['dif'] = df.groupby('id')['day'].diff(-1) * (-1) 时出现 ValueError
问题出在我的数据集中,day 列是字符串,将其转换为硅藻土,您的代码现在可以工作了以上是关于减去熊猫(Python)中按id分组的数据框中的连续行的主要内容,如果未能解决你的问题,请参考以下文章