Pandas DataFrame groupby,跨列计数和求和

Posted

技术标签:

【中文标题】Pandas DataFrame groupby,跨列计数和求和【英文标题】:Pandas DataFrame groupby, count and sum across columns 【发布时间】:2021-05-24 12:48:34 【问题描述】:

我有一个如下所示的数据集。它具有随时间累积的车辆数量。

Image Describing the Expected Output

LcounterCar,LcounterTruck,LcounterBus,LcounterMotorcycle,LcounterVan,Ltime,RcounterCar,RcounterTruck,RcounterBus,RcounterMotorcycle,RcounterVan,Rtime

1,0,0,0,0,2021-02-22 13:22:00,,,,,
2,0,0,0,0,2021-02-22 13:23:00,,,,,
3,1,0,0,0,2021-02-22 13:23:00,,,,,
4,0,0,0,0,2021-02-22 13:24:00,,,,,
5,0,0,0,0,2021-02-22 13:25:00,,,,,
6,2,0,0,0,2021-02-22 13:25:00,,,,,
,,,,,,1,0,0,0,0,2021-02-22 13:25:00
,,,,,,2,0,0,0,0,2021-02-22 13:27:00

我创建了一个 Pandas 数据框,我想按 Ltime 和 Rtime 进行分组,并获取与类别无关的车辆总数(例如,在给定时间段内左行 (L) 中的车辆总数和总数给定时间段内右线 (R) 中的车辆数)。

以下是我尝试过的

data = pd.read_csv('output2.txt')
data['Ltime'] = pd.to_datetime(data['Ltime'].str.strip())
data['Rtime'] = pd.to_datetime(data['Rtime'].str.strip())
data.info()
#   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   LcounterCar         6 non-null      float64       
 1   LcounterTruck       6 non-null      float64       
 2   LcounterBus         6 non-null      float64       
 3   LcounterMotorcycle  6 non-null      float64       
 4   LcounterVan         6 non-null      float64       
 5   Ltime               6 non-null      datetime64[ns]
 6   RcounterCar         2 non-null      float64       
 7   RcounterTruck       2 non-null      float64       
 8   RcounterBus         2 non-null      float64       
 9   RcounterMotorcycle  2 non-null      float64       
 10  RcounterVan         2 non-null      float64       
 11  Rtime               2 non-null      datetime64[ns]

data.groupby('Ltime')['LcounterCar'].count().reset_index()

          Ltime     LcounterTruck
0   2021-02-22 13:22:00     1
1   2021-02-22 13:23:00     2
2   2021-02-22 13:24:00     1
3   2021-02-22 13:25:00     2

但是,计数始终相同。相反,以下是我的预期输出。

Ltime, count
13:22:00, 1
13:23:00, 3 (two cars and one truck)
13:24:00, 1
13:25:00, 3

Rtime, count
13:25:00, 1
13:27:00, 1

【问题讨论】:

你的样本数据对吗? LcounterCar 的值为 1,2,3,4,5,6 【参考方案1】:

您的数据与您所描述的不一致

将左右视为单独的数据集 您描述的是sum() 而不是count(),因此使用了sum() unstack() 列,使其成为直截了当的groupby(level=1).count() 处理 0 和 NaN 使其不被计算在内 使用concat()左右拉在一起 计算最终值

1,0,0,0,0,2021-02-22 13:22:00,,,,,
2,0,0,0,0,2021-02-22 13:23:00,,,,,
3,1,0,0,0,2021-02-22 13:23:00,,,,,
4,0,0,0,0,2021-02-22 13:24:00,,,,,
5,0,0,0,0,2021-02-22 13:25:00,,,,,
6,2,0,0,0,2021-02-22 13:25:00,,,,,
,,,,,,1,0,0,0,0,2021-02-22 13:25:00
,,,,,,2,0,0,0,0,2021-02-22 13:27:00"""))

df.Ltime = pd.to_datetime(df.Ltime)
df.Rtime = pd.to_datetime(df.Rtime)

df2 = pd.concat([
    (df.loc[:,[c for c in df.columns if c[0]==side]]
     .dropna().set_index(f"sidetime")
     .unstack().replace(0:np.nan).groupby(level=1).count())
    for side in list("LR")]).groupby(level=0).sum()

df

LcounterCar LcounterTruck LcounterBus LcounterMotorcycle LcounterVan Ltime RcounterCar RcounterTruck RcounterBus RcounterMotorcycle RcounterVan Rtime
0 1 0 0 0 0 2021-02-22 13:22:00 nan nan nan nan nan NaT
1 2 0 0 0 0 2021-02-22 13:23:00 nan nan nan nan nan NaT
2 3 1 0 0 0 2021-02-22 13:23:00 nan nan nan nan nan NaT
3 4 0 0 0 0 2021-02-22 13:24:00 nan nan nan nan nan NaT
4 5 0 0 0 0 2021-02-22 13:25:00 nan nan nan nan nan NaT
5 6 2 0 0 0 2021-02-22 13:25:00 nan nan nan nan nan NaT
6 nan nan nan nan nan NaT 1 0 0 0 0 2021-02-22 13:25:00
7 nan nan nan nan nan NaT 2 0 0 0 0 2021-02-22 13:27:00

df2

0
2021-02-22 13:22:00 1
2021-02-22 13:23:00 3
2021-02-22 13:24:00 1
2021-02-22 13:25:00 4
2021-02-22 13:27:00 1

【讨论】:

谢谢。我添加了一张图片来描述我想要什么。

以上是关于Pandas DataFrame groupby,跨列计数和求和的主要内容,如果未能解决你的问题,请参考以下文章

将 pandas.core.groupby.SeriesGroupBy 转换为 DataFrame

如何将pandas dataframe进行groupby操作后得到的数据结构转换为dataframe?

pandas将初始dataframe基于分组变量拆分为多个新的dataframe使用groupby函数tuple函数dict函数(splitting dataframe multiple)

Dataframe Pandas 聚合和/或 groupby

Python pandas dataframe groupby 选择列

Pandas DataFrame groupby,跨列计数和求和