如何使 pandas merge_asof 不仅包含所有事件
Posted
技术标签:
【中文标题】如何使 pandas merge_asof 不仅包含所有事件【英文标题】:how to make pandas merge_asof include all events not only one 【发布时间】:2018-07-17 12:55:21 【问题描述】:我有一个数据集,包括这样的班次开始和结束:
schedule = pd.DataFrame(
"start": pd.to_datetime(['2017-01-01 00:59:00', '2017-01-01 04:59:00', '2017-01-02 00:59:00', '2017-01-02 08:00:00', '2017-01-02 09:59:00']),
"end": pd.to_datetime(['2017-01-01 09:59:00', '2017-01-01 18:00:00', '2017-01-02 09:59:00', '2017-01-02 15:59:00', '2017-01-02 18:00:00']),
"employee": ['KC', 'IT', 'ED', 'NK', 'IT']
)
我希望最终能够知道一天中有多少人(以及谁)在特定时间工作。所以我尝试用我想要的频率的时间戳制作一个新的 DataFrame:
shifts = pd.DataFrame()
shifts['timestamp'] = pd.date_range(start=schedule.start.min(), end=schedule.end.max(), freq='2H')
并[有条件地]将其与我的原始时间表合并如下:
mrg = pd.merge_asof(shifts, schedule, left_on='timestamp', right_on='start').query('timestamp <= end')
结果如下:
timestamp employee end start
0 2017-01-01 00:59:00 KC 2017-01-01 09:59:00 2017-01-01 00:59:00
1 2017-01-01 02:59:00 KC 2017-01-01 09:59:00 2017-01-01 00:59:00
2 2017-01-01 04:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
3 2017-01-01 06:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
4 2017-01-01 08:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
5 2017-01-01 10:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
现在我的问题是,当 KC 和 IT 都在工作时,时间戳在 2017-01-01 04:59:00 和 2017-01-01 09:59:00 之间,但 mrg 数据框只保留相应的行到 IT。为什么会这样?我在发送到 merge_asof 的参数中缺少什么?
【问题讨论】:
【参考方案1】:看来您需要将所有employees
与timestamps
组合起来,然后添加参数by
:
from itertools import product
t = pd.date_range(start=schedule.start.min(), end=schedule.end.max(), freq='2H')
e = schedule['employee'].unique().tolist()
shifts = pd.DataFrame(list(product(t,e)), columns=['timestamp','employee'])
print (shifts.head(10))
timestamp employee
0 2017-01-01 00:59:00 KC
1 2017-01-01 00:59:00 IT
2 2017-01-01 00:59:00 ED
3 2017-01-01 00:59:00 NK
4 2017-01-01 02:59:00 KC
5 2017-01-01 02:59:00 IT
6 2017-01-01 02:59:00 ED
7 2017-01-01 02:59:00 NK
8 2017-01-01 04:59:00 KC
9 2017-01-01 04:59:00 IT
mrg = pd.merge_asof(shifts,
schedule,
left_on='timestamp',
right_on='start',
by='employee').query('timestamp <= end')
print (mrg)
timestamp employee end start
0 2017-01-01 00:59:00 KC 2017-01-01 09:59:00 2017-01-01 00:59:00
4 2017-01-01 02:59:00 KC 2017-01-01 09:59:00 2017-01-01 00:59:00
8 2017-01-01 04:59:00 KC 2017-01-01 09:59:00 2017-01-01 00:59:00
9 2017-01-01 04:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
12 2017-01-01 06:59:00 KC 2017-01-01 09:59:00 2017-01-01 00:59:00
13 2017-01-01 06:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
16 2017-01-01 08:59:00 KC 2017-01-01 09:59:00 2017-01-01 00:59:00
17 2017-01-01 08:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
21 2017-01-01 10:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
25 2017-01-01 12:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
29 2017-01-01 14:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
33 2017-01-01 16:59:00 IT 2017-01-01 18:00:00 2017-01-01 04:59:00
50 2017-01-02 00:59:00 ED 2017-01-02 09:59:00 2017-01-02 00:59:00
54 2017-01-02 02:59:00 ED 2017-01-02 09:59:00 2017-01-02 00:59:00
58 2017-01-02 04:59:00 ED 2017-01-02 09:59:00 2017-01-02 00:59:00
62 2017-01-02 06:59:00 ED 2017-01-02 09:59:00 2017-01-02 00:59:00
66 2017-01-02 08:59:00 ED 2017-01-02 09:59:00 2017-01-02 00:59:00
67 2017-01-02 08:59:00 NK 2017-01-02 15:59:00 2017-01-02 08:00:00
69 2017-01-02 10:59:00 IT 2017-01-02 18:00:00 2017-01-02 09:59:00
71 2017-01-02 10:59:00 NK 2017-01-02 15:59:00 2017-01-02 08:00:00
73 2017-01-02 12:59:00 IT 2017-01-02 18:00:00 2017-01-02 09:59:00
75 2017-01-02 12:59:00 NK 2017-01-02 15:59:00 2017-01-02 08:00:00
77 2017-01-02 14:59:00 IT 2017-01-02 18:00:00 2017-01-02 09:59:00
79 2017-01-02 14:59:00 NK 2017-01-02 15:59:00 2017-01-02 08:00:00
81 2017-01-02 16:59:00 IT 2017-01-02 18:00:00 2017-01-02 09:59:00
【讨论】:
啊...这是有道理的。谢谢!不过有一件事,我在产品(t,e)处遇到错误。这个能处理DataFrame和list的积函数不是纯python函数吧?! 哎呀,我忘了,需要from itertools import product
【参考方案2】:
来自pyjanitor的conditional_join可能有助于抽象/方便:
# pip install pyjanitor
import pandas as pd
import janitor
shifts.conditional_join(
schedule,
('timestamp', 'start', '>='),
('timestamp', 'end', '<=')
)
timestamp start end employee
0 2017-01-01 00:59:00 2017-01-01 00:59:00 2017-01-01 09:59:00 KC
1 2017-01-01 02:59:00 2017-01-01 00:59:00 2017-01-01 09:59:00 KC
2 2017-01-01 04:59:00 2017-01-01 00:59:00 2017-01-01 09:59:00 KC
3 2017-01-01 04:59:00 2017-01-01 04:59:00 2017-01-01 18:00:00 IT
4 2017-01-01 06:59:00 2017-01-01 00:59:00 2017-01-01 09:59:00 KC
5 2017-01-01 06:59:00 2017-01-01 04:59:00 2017-01-01 18:00:00 IT
6 2017-01-01 08:59:00 2017-01-01 00:59:00 2017-01-01 09:59:00 KC
7 2017-01-01 08:59:00 2017-01-01 04:59:00 2017-01-01 18:00:00 IT
8 2017-01-01 10:59:00 2017-01-01 04:59:00 2017-01-01 18:00:00 IT
9 2017-01-01 12:59:00 2017-01-01 04:59:00 2017-01-01 18:00:00 IT
10 2017-01-01 14:59:00 2017-01-01 04:59:00 2017-01-01 18:00:00 IT
11 2017-01-01 16:59:00 2017-01-01 04:59:00 2017-01-01 18:00:00 IT
12 2017-01-02 00:59:00 2017-01-02 00:59:00 2017-01-02 09:59:00 ED
13 2017-01-02 02:59:00 2017-01-02 00:59:00 2017-01-02 09:59:00 ED
14 2017-01-02 04:59:00 2017-01-02 00:59:00 2017-01-02 09:59:00 ED
15 2017-01-02 06:59:00 2017-01-02 00:59:00 2017-01-02 09:59:00 ED
16 2017-01-02 08:59:00 2017-01-02 00:59:00 2017-01-02 09:59:00 ED
17 2017-01-02 08:59:00 2017-01-02 08:00:00 2017-01-02 15:59:00 NK
18 2017-01-02 10:59:00 2017-01-02 08:00:00 2017-01-02 15:59:00 NK
19 2017-01-02 10:59:00 2017-01-02 09:59:00 2017-01-02 18:00:00 IT
20 2017-01-02 12:59:00 2017-01-02 08:00:00 2017-01-02 15:59:00 NK
21 2017-01-02 12:59:00 2017-01-02 09:59:00 2017-01-02 18:00:00 IT
22 2017-01-02 14:59:00 2017-01-02 08:00:00 2017-01-02 15:59:00 NK
23 2017-01-02 14:59:00 2017-01-02 09:59:00 2017-01-02 18:00:00 IT
24 2017-01-02 16:59:00 2017-01-02 09:59:00 2017-01-02 18:00:00 IT
这将返回时间戳在开始和结束之间的行。如果间隔不重叠,更有效的解决方案是使用pd.IntervalIndex
。
【讨论】:
以上是关于如何使 pandas merge_asof 不仅包含所有事件的主要内容,如果未能解决你的问题,请参考以下文章