Groupby 和 Python/R 的区别
Posted
技术标签:
【中文标题】Groupby 和 Python/R 的区别【英文标题】:Groupby and difference in Python/R 【发布时间】:2021-10-04 20:57:57 【问题描述】:我有一个数据集如下
我想按代理列分组并获得每个代理的最大和最小解决时间之间的差异(例如,对于 Adnan Shaikh,输出将是 01:58:22)。
如何在 Python/R 中做到这一点?
【问题讨论】:
到目前为止你尝试了什么? 对于每个代理,Resolved.time
是否总是单调递增的?
【参考方案1】:
对于 python 来说是:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=
"Agent": ["Adnan Shaikh", "Adnan Shaikh", "Adnan Shaikh",
"Akshay Padaya", "Akshay Padaya", "Akshay Padaya",
"Akshay Padaya"],
"Resolved.time": ["2021-07-28 12:11",
"2021-07-28 12:23",
"2021-07-28 13:06",
"2021-07-28 10:44",
"2021-07-28 12:45",
"2021-07-28 13:05",
np.nan])
df["Resolved.time"] = pd.to_datetime(df["Resolved.time"], format="%Y-%m-%d %H:%M")
result = df.groupby("Agent").agg(
Resolved_time=("Resolved.time", lambda x: np.max(x) - np.min(x))
).reset_index()
结果是这样的:
Agent | Resolved_time | |
---|---|---|
0 | Adnan Shaikh | 0 days 00:55:00 |
1 | Akshay Padaya | 0 days 02:21:00 |
【讨论】:
【参考方案2】:在 R 中,类似于:
library(tidyverse)
df <- tibble(agent = c("Adnan Shaikh", "Adnan Shaikh", "Adnan Shaikh", "Akshay Padaya", "Akshay Padaya", "Akshay Padaya", "Akshay Padaya"),
Resolved.time =lubridate::ymd_hm(c("2021-07-28 12:11","2021-07-28 12:23", "2021-07-28 13:06", "2021-07-28 10:44", "2021-07-28 12:45", "2021-07-28 13:05", NA)))
df %>%
na.omit() %>%
group_by(agent) %>%
mutate(result = max(Resolved.time) - min(Resolved.time), result = lubridate::seconds_to_period(result))
给予:
# A tibble: 6 x 3
# Groups: agent [2]
agent Resolved.time result
<chr> <dttm> <Period>
1 Adnan Shaikh 2021-07-28 12:11:00 55M 0S
2 Adnan Shaikh 2021-07-28 12:23:00 55M 0S
3 Adnan Shaikh 2021-07-28 13:06:00 55M 0S
4 Akshay Padaya 2021-07-28 10:44:00 2H 21M 0S
5 Akshay Padaya 2021-07-28 12:45:00 2H 21M 0S
6 Akshay Padaya 2021-07-28 13:05:00 2H 21M 0S
【讨论】:
以上是关于Groupby 和 Python/R 的区别的主要内容,如果未能解决你的问题,请参考以下文章