Pandas DataFrame groupby,跨列计数和求和
Posted
技术标签:
【中文标题】Pandas DataFrame groupby,跨列计数和求和【英文标题】:Pandas DataFrame groupby, count and sum across columns 【发布时间】:2021-05-24 12:48:34 【问题描述】:我有一个如下所示的数据集。它具有随时间累积的车辆数量。
Image Describing the Expected Output
LcounterCar,LcounterTruck,LcounterBus,LcounterMotorcycle,LcounterVan,Ltime,RcounterCar,RcounterTruck,RcounterBus,RcounterMotorcycle,RcounterVan,Rtime
1,0,0,0,0,2021-02-22 13:22:00,,,,,
2,0,0,0,0,2021-02-22 13:23:00,,,,,
3,1,0,0,0,2021-02-22 13:23:00,,,,,
4,0,0,0,0,2021-02-22 13:24:00,,,,,
5,0,0,0,0,2021-02-22 13:25:00,,,,,
6,2,0,0,0,2021-02-22 13:25:00,,,,,
,,,,,,1,0,0,0,0,2021-02-22 13:25:00
,,,,,,2,0,0,0,0,2021-02-22 13:27:00
我创建了一个 Pandas 数据框,我想按 Ltime 和 Rtime 进行分组,并获取与类别无关的车辆总数(例如,在给定时间段内左行 (L) 中的车辆总数和总数给定时间段内右线 (R) 中的车辆数)。
以下是我尝试过的
data = pd.read_csv('output2.txt')
data['Ltime'] = pd.to_datetime(data['Ltime'].str.strip())
data['Rtime'] = pd.to_datetime(data['Rtime'].str.strip())
data.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LcounterCar 6 non-null float64
1 LcounterTruck 6 non-null float64
2 LcounterBus 6 non-null float64
3 LcounterMotorcycle 6 non-null float64
4 LcounterVan 6 non-null float64
5 Ltime 6 non-null datetime64[ns]
6 RcounterCar 2 non-null float64
7 RcounterTruck 2 non-null float64
8 RcounterBus 2 non-null float64
9 RcounterMotorcycle 2 non-null float64
10 RcounterVan 2 non-null float64
11 Rtime 2 non-null datetime64[ns]
data.groupby('Ltime')['LcounterCar'].count().reset_index()
Ltime LcounterTruck
0 2021-02-22 13:22:00 1
1 2021-02-22 13:23:00 2
2 2021-02-22 13:24:00 1
3 2021-02-22 13:25:00 2
但是,计数始终相同。相反,以下是我的预期输出。
Ltime, count
13:22:00, 1
13:23:00, 3 (two cars and one truck)
13:24:00, 1
13:25:00, 3
Rtime, count
13:25:00, 1
13:27:00, 1
【问题讨论】:
你的样本数据对吗? LcounterCar 的值为 1,2,3,4,5,6 【参考方案1】:您的数据与您所描述的不一致
将左右视为单独的数据集 您描述的是sum()
而不是count()
,因此使用了sum()
unstack()
列,使其成为直截了当的groupby(level=1).count()
处理 0 和 NaN 使其不被计算在内
使用concat()
左右拉在一起
计算最终值
1,0,0,0,0,2021-02-22 13:22:00,,,,,
2,0,0,0,0,2021-02-22 13:23:00,,,,,
3,1,0,0,0,2021-02-22 13:23:00,,,,,
4,0,0,0,0,2021-02-22 13:24:00,,,,,
5,0,0,0,0,2021-02-22 13:25:00,,,,,
6,2,0,0,0,2021-02-22 13:25:00,,,,,
,,,,,,1,0,0,0,0,2021-02-22 13:25:00
,,,,,,2,0,0,0,0,2021-02-22 13:27:00"""))
df.Ltime = pd.to_datetime(df.Ltime)
df.Rtime = pd.to_datetime(df.Rtime)
df2 = pd.concat([
(df.loc[:,[c for c in df.columns if c[0]==side]]
.dropna().set_index(f"sidetime")
.unstack().replace(0:np.nan).groupby(level=1).count())
for side in list("LR")]).groupby(level=0).sum()
df
LcounterCar | LcounterTruck | LcounterBus | LcounterMotorcycle | LcounterVan | Ltime | RcounterCar | RcounterTruck | RcounterBus | RcounterMotorcycle | RcounterVan | Rtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 2021-02-22 13:22:00 | nan | nan | nan | nan | nan | NaT |
1 | 2 | 0 | 0 | 0 | 0 | 2021-02-22 13:23:00 | nan | nan | nan | nan | nan | NaT |
2 | 3 | 1 | 0 | 0 | 0 | 2021-02-22 13:23:00 | nan | nan | nan | nan | nan | NaT |
3 | 4 | 0 | 0 | 0 | 0 | 2021-02-22 13:24:00 | nan | nan | nan | nan | nan | NaT |
4 | 5 | 0 | 0 | 0 | 0 | 2021-02-22 13:25:00 | nan | nan | nan | nan | nan | NaT |
5 | 6 | 2 | 0 | 0 | 0 | 2021-02-22 13:25:00 | nan | nan | nan | nan | nan | NaT |
6 | nan | nan | nan | nan | nan | NaT | 1 | 0 | 0 | 0 | 0 | 2021-02-22 13:25:00 |
7 | nan | nan | nan | nan | nan | NaT | 2 | 0 | 0 | 0 | 0 | 2021-02-22 13:27:00 |
df2
0 | |
---|---|
2021-02-22 13:22:00 | 1 |
2021-02-22 13:23:00 | 3 |
2021-02-22 13:24:00 | 1 |
2021-02-22 13:25:00 | 4 |
2021-02-22 13:27:00 | 1 |
【讨论】:
谢谢。我添加了一张图片来描述我想要什么。以上是关于Pandas DataFrame groupby,跨列计数和求和的主要内容,如果未能解决你的问题,请参考以下文章
将 pandas.core.groupby.SeriesGroupBy 转换为 DataFrame
如何将pandas dataframe进行groupby操作后得到的数据结构转换为dataframe?
pandas将初始dataframe基于分组变量拆分为多个新的dataframe使用groupby函数tuple函数dict函数(splitting dataframe multiple)
Dataframe Pandas 聚合和/或 groupby