在分组后为缺失的行添加零值

Posted

技术标签:

【中文标题】在分组后为缺失的行添加零值【英文标题】:Adding value of zero for missing rows after a group by 【发布时间】:2021-12-29 14:32:59 【问题描述】:

我有一个 pandas 数据框,假设我的数据会跟踪进入我商店的人。 我的数据如下:

Name month day hour
Albert October 31 5
John October 31 6
Jane October 31 6
Albert October 31 8
Albert October 31 9
John October 31 10
Jane October 31 11
Albert October 31 12
John October 31 12
Emily October 31 12
Albert October 31 20
Emily October 31 21
John October 31 23
Jane October 31 23
Albert November 1 5
John November 1 6
Jane November 1 6
Albert November 1 8
Albert November 1 9
John November 1 10
Jane November 1 11
Albert November 1 12
John November 1 12
Emily November 1 12
Rose November 1 15
Elizabeth November 1 16
Karen November 1 16
Albert November 1 20
Emily November 1 21
John November 1 23
Jane November 1 23

我按月、日、小时分组以获取我的客户数量,我得到:

count month day hour
1 October 31 5
2 October 31 6
1 October 31 8
1 October 31 9
1 October 31 10
1 October 31 11
3 October 31 12
1 October 31 20
1 October 31 21
2 October 31 23
1 November 1 5
2 November 1 6
1 November 1 8
1 November 1 9
1 November 1 10
1 November 1 11
3 November 1 12
1 November 1 15
2 November 1 16
1 November 1 20
1 November 1 21
2 November 1 23

但是,我还想将“0”添加到数据中,并得到如下内容:

count month day hour
0 October 31 0
0 October 31 1
0 October 31 2
0 October 31 3
0 October 31 4
1 October 31 5
2 October 31 6
1 October 31 8
1 October 31 9
1 October 31 10
1 October 31 11
3 October 31 12
1 October 31 20
1 October 31 21
2 October 31 23
0 November 1 0
0 November 1 1
0 November 1 2
0 November 1 3
0 November 1 4
1 November 1 5
2 November 1 6
1 November 1 8
1 November 1 9
1 November 1 10
1 November 1 11
3 November 1 12
1 November 1 15
2 November 1 16
1 November 1 20
1 November 1 21
2 November 1 23

有没有办法以编程方式做到这一点?

【问题讨论】:

以代码形式提供您的输入数据。 【参考方案1】:

我建议:

import calendar  # note only used to revert month to literal

import pandas as pd
from datetime import datetime

# Create a new dataFrame counting people for each day
# note the renaming
hourly_data = df.groupby(['month', 'day', 'hour'], as_index=False).count().rename(columns= 'Name': 'count')

# Create a datetime column
# use pd.to_datetime on a string for the date and hour
# note insertions 1/ of a year, 2/ of minutes 
hourly_data['DH'] = pd.to_datetime(hourly_data['day'].map(str) + " " + hourly_data['month'].map(str) + " " + '2021' + " " + hourly_data['hour'].map(str) + ":00")

# Keep only the needed columns
hourly_data = hourly_data[['DH', 'count']]

# Create the missing rows by setting an index with a hourly frequency
# note sorting to avoid some errors
hourly_data = hourly_data.set_index('DH').sort_index().asfreq('h')

# Fill the missing values for created rows with 0  
hourly_data['count'].fillna(value = 0,  inplace = True)

# Now we can revert count to int
hourly_data['count'] = hourly_data['count'].astype(int)

# Create columns from the index
hourly_data['year'], hourly_data['month'], hourly_data['day'],  hourly_data['hour'] = hourly_data.index.year, hourly_data.index.month, hourly_data.index.day, hourly_data.index.hour

# and convert month to litteral
hourly_data['month'] = hourly_data['month'].apply(lambda i:calendar.month_name[i])

# Reorder columns
hourly_data = hourly_data[['year','month', 'day', 'hour', 'count']] 

假设以下数据:

import io

s = """
Name   month    day hour
Albert October  31  5
John   October  31  6
Jane   October  31  6
Albert October  31  8
Jane   October  31  23
Albert November 1   5
John   November 1   6
Jane   November 1   6
Albert November 1   8
Albert November 1   9
John   November 1   10
Jane   November 1   23
"""
df = pd.read_csv(io.StringIO(s), sep='\s+')

每小时数据:

                     year     month  day  hour  count
DH                                                   
2021-10-31 05:00:00  2021   October   31     5      1
2021-10-31 06:00:00  2021   October   31     6      2
2021-10-31 07:00:00  2021   October   31     7      0
2021-10-31 08:00:00  2021   October   31     8      1
2021-10-31 09:00:00  2021   October   31     9      0
2021-10-31 10:00:00  2021   October   31    10      0
2021-10-31 11:00:00  2021   October   31    11      0
2021-10-31 12:00:00  2021   October   31    12      0
2021-10-31 13:00:00  2021   October   31    13      0
2021-10-31 14:00:00  2021   October   31    14      0
2021-10-31 15:00:00  2021   October   31    15      0
2021-10-31 16:00:00  2021   October   31    16      0
2021-10-31 17:00:00  2021   October   31    17      0
2021-10-31 18:00:00  2021   October   31    18      0
2021-10-31 19:00:00  2021   October   31    19      0
2021-10-31 20:00:00  2021   October   31    20      0
2021-10-31 21:00:00  2021   October   31    21      0
2021-10-31 22:00:00  2021   October   31    22      0
2021-10-31 23:00:00  2021   October   31    23      1
2021-11-01 00:00:00  2021  November    1     0      0
2021-11-01 01:00:00  2021  November    1     1      0
2021-11-01 02:00:00  2021  November    1     2      0
2021-11-01 03:00:00  2021  November    1     3      0
2021-11-01 04:00:00  2021  November    1     4      0
2021-11-01 05:00:00  2021  November    1     5      1
2021-11-01 06:00:00  2021  November    1     6      2
2021-11-01 07:00:00  2021  November    1     7      0
2021-11-01 08:00:00  2021  November    1     8      1
2021-11-01 09:00:00  2021  November    1     9      1
2021-11-01 10:00:00  2021  November    1    10      1
2021-11-01 11:00:00  2021  November    1    11      0
2021-11-01 12:00:00  2021  November    1    12      0
2021-11-01 13:00:00  2021  November    1    13      0
2021-11-01 14:00:00  2021  November    1    14      0
2021-11-01 15:00:00  2021  November    1    15      0
2021-11-01 16:00:00  2021  November    1    16      0
2021-11-01 17:00:00  2021  November    1    17      0
2021-11-01 18:00:00  2021  November    1    18      0
2021-11-01 19:00:00  2021  November    1    19      0
2021-11-01 20:00:00  2021  November    1    20      0
2021-11-01 21:00:00  2021  November    1    21      0
2021-11-01 22:00:00  2021  November    1    22      0
2021-11-01 23:00:00  2021  November    1    23      1

并且 hourly_data 是类型

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43 entries, 2021-10-31 05:00:00 to 2021-11-01 23:00:00
Freq: H
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    43 non-null     int64 
 1   month   43 non-null     object
 2   day     43 non-null     int64 
 3   hour    43 non-null     int64 
 4   count   43 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 2.0+ KB

上面的索引类型DatetimeIndex和它的频率Freq: H 保留索引是因为它对某些下游处理很有用。

【讨论】:

以上是关于在分组后为缺失的行添加零值的主要内容,如果未能解决你的问题,请参考以下文章

根据日期计算/分组行,包括缺失

用之前的非缺失值填充缺失的 pandas 数据,按 key 分组

Pandas 为缺失的日期填零 *由 * 组定义

R语言缺失值替换:缺失的值(NA)替换每个分组最近的非缺失值

Pyspark - 每个键添加缺失值?

Python,Pandas:只返回那些有缺失值的行