Pandas DataFrame groupby，跨列计数和求和

Posted 2023-03-11

技术标签:

【中文标题】Pandas DataFrame groupby，跨列计数和求和【英文标题】：Pandas DataFrame groupby, count and sum across columns 【发布时间】：2021-05-24 12:48:34 【问题描述】：

我有一个如下所示的数据集。它具有随时间累积的车辆数量。

Image Describing the Expected Output

LcounterCar,LcounterTruck,LcounterBus,LcounterMotorcycle,LcounterVan,Ltime,RcounterCar,RcounterTruck,RcounterBus,RcounterMotorcycle,RcounterVan,Rtime

1,0,0,0,0,2021-02-22 13:22:00,,,,,
2,0,0,0,0,2021-02-22 13:23:00,,,,,
3,1,0,0,0,2021-02-22 13:23:00,,,,,
4,0,0,0,0,2021-02-22 13:24:00,,,,,
5,0,0,0,0,2021-02-22 13:25:00,,,,,
6,2,0,0,0,2021-02-22 13:25:00,,,,,
,,,,,,1,0,0,0,0,2021-02-22 13:25:00
,,,,,,2,0,0,0,0,2021-02-22 13:27:00

我创建了一个 Pandas 数据框，我想按 Ltime 和 Rtime 进行分组，并获取与类别无关的车辆总数（例如，在给定时间段内左行 (L) 中的车辆总数和总数给定时间段内右线 (R) 中的车辆数）。

以下是我尝试过的

data = pd.read_csv('output2.txt')
data['Ltime'] = pd.to_datetime(data['Ltime'].str.strip())
data['Rtime'] = pd.to_datetime(data['Rtime'].str.strip())
data.info()
#   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   LcounterCar         6 non-null      float64       
 1   LcounterTruck       6 non-null      float64       
 2   LcounterBus         6 non-null      float64       
 3   LcounterMotorcycle  6 non-null      float64       
 4   LcounterVan         6 non-null      float64       
 5   Ltime               6 non-null      datetime64[ns]
 6   RcounterCar         2 non-null      float64       
 7   RcounterTruck       2 non-null      float64       
 8   RcounterBus         2 non-null      float64       
 9   RcounterMotorcycle  2 non-null      float64       
 10  RcounterVan         2 non-null      float64       
 11  Rtime               2 non-null      datetime64[ns]

data.groupby('Ltime')['LcounterCar'].count().reset_index()

          Ltime     LcounterTruck
0   2021-02-22 13:22:00     1
1   2021-02-22 13:23:00     2
2   2021-02-22 13:24:00     1
3   2021-02-22 13:25:00     2

但是，计数始终相同。相反，以下是我的预期输出。

Ltime, count
13:22:00, 1
13:23:00, 3 (two cars and one truck)
13:24:00, 1
13:25:00, 3

Rtime, count
13:25:00, 1
13:27:00, 1

【问题讨论】：

你的样本数据对吗？ LcounterCar 的值为 1,2,3,4,5,6 【参考方案1】：

您的数据与您所描述的不一致

将左右视为单独的数据集您描述的是sum() 而不是count()，因此使用了sum() unstack() 列，使其成为直截了当的groupby(level=1).count() 处理 0 和 NaN 使其不被计算在内使用concat()左右拉在一起计算最终值


1,0,0,0,0,2021-02-22 13:22:00,,,,,
2,0,0,0,0,2021-02-22 13:23:00,,,,,
3,1,0,0,0,2021-02-22 13:23:00,,,,,
4,0,0,0,0,2021-02-22 13:24:00,,,,,
5,0,0,0,0,2021-02-22 13:25:00,,,,,
6,2,0,0,0,2021-02-22 13:25:00,,,,,
,,,,,,1,0,0,0,0,2021-02-22 13:25:00
,,,,,,2,0,0,0,0,2021-02-22 13:27:00"""))

df.Ltime = pd.to_datetime(df.Ltime)
df.Rtime = pd.to_datetime(df.Rtime)

df2 = pd.concat([
    (df.loc[:,[c for c in df.columns if c[0]==side]]
     .dropna().set_index(f"sidetime")
     .unstack().replace(0:np.nan).groupby(level=1).count())
    for side in list("LR")]).groupby(level=0).sum()

df

	LcounterCar	LcounterTruck	LcounterBus	LcounterMotorcycle	LcounterVan	Ltime	RcounterCar	RcounterTruck	RcounterBus	RcounterMotorcycle	RcounterVan	Rtime
0	1	0	0	0	0	2021-02-22 13:22:00	nan	nan	nan	nan	nan	NaT
1	2	0	0	0	0	2021-02-22 13:23:00	nan	nan	nan	nan	nan	NaT
2	3	1	0	0	0	2021-02-22 13:23:00	nan	nan	nan	nan	nan	NaT
3	4	0	0	0	0	2021-02-22 13:24:00	nan	nan	nan	nan	nan	NaT
4	5	0	0	0	0	2021-02-22 13:25:00	nan	nan	nan	nan	nan	NaT
5	6	2	0	0	0	2021-02-22 13:25:00	nan	nan	nan	nan	nan	NaT
6	nan	nan	nan	nan	nan	NaT	1	0	0	0	0	2021-02-22 13:25:00
7	nan	nan	nan	nan	nan	NaT	2	0	0	0	0	2021-02-22 13:27:00

df2

	0
2021-02-22 13:22:00	1
2021-02-22 13:23:00	3
2021-02-22 13:24:00	1
2021-02-22 13:25:00	4
2021-02-22 13:27:00	1

【讨论】：

谢谢。我添加了一张图片来描述我想要什么。

以上是关于Pandas DataFrame groupby，跨列计数和求和的主要内容，如果未能解决你的问题，请参考以下文章

将 pandas.core.groupby.SeriesGroupBy 转换为 DataFrame

如何将pandas dataframe进行groupby操作后得到的数据结构转换为dataframe？

pandas将初始dataframe基于分组变量拆分为多个新的dataframe使用groupby函数tuple函数dict函数（splitting dataframe multiple)

Dataframe Pandas 聚合和/或 groupby

Python pandas dataframe groupby 选择列

Pandas DataFrame groupby，跨列计数和求和