在分组后为缺失的行添加零值
Posted
技术标签:
【中文标题】在分组后为缺失的行添加零值【英文标题】:Adding value of zero for missing rows after a group by 【发布时间】:2021-12-29 14:32:59 【问题描述】:我有一个 pandas 数据框,假设我的数据会跟踪进入我商店的人。 我的数据如下:
Name | month | day | hour |
---|---|---|---|
Albert | October | 31 | 5 |
John | October | 31 | 6 |
Jane | October | 31 | 6 |
Albert | October | 31 | 8 |
Albert | October | 31 | 9 |
John | October | 31 | 10 |
Jane | October | 31 | 11 |
Albert | October | 31 | 12 |
John | October | 31 | 12 |
Emily | October | 31 | 12 |
Albert | October | 31 | 20 |
Emily | October | 31 | 21 |
John | October | 31 | 23 |
Jane | October | 31 | 23 |
Albert | November | 1 | 5 |
John | November | 1 | 6 |
Jane | November | 1 | 6 |
Albert | November | 1 | 8 |
Albert | November | 1 | 9 |
John | November | 1 | 10 |
Jane | November | 1 | 11 |
Albert | November | 1 | 12 |
John | November | 1 | 12 |
Emily | November | 1 | 12 |
Rose | November | 1 | 15 |
Elizabeth | November | 1 | 16 |
Karen | November | 1 | 16 |
Albert | November | 1 | 20 |
Emily | November | 1 | 21 |
John | November | 1 | 23 |
Jane | November | 1 | 23 |
我按月、日、小时分组以获取我的客户数量,我得到:
count | month | day | hour |
---|---|---|---|
1 | October | 31 | 5 |
2 | October | 31 | 6 |
1 | October | 31 | 8 |
1 | October | 31 | 9 |
1 | October | 31 | 10 |
1 | October | 31 | 11 |
3 | October | 31 | 12 |
1 | October | 31 | 20 |
1 | October | 31 | 21 |
2 | October | 31 | 23 |
1 | November | 1 | 5 |
2 | November | 1 | 6 |
1 | November | 1 | 8 |
1 | November | 1 | 9 |
1 | November | 1 | 10 |
1 | November | 1 | 11 |
3 | November | 1 | 12 |
1 | November | 1 | 15 |
2 | November | 1 | 16 |
1 | November | 1 | 20 |
1 | November | 1 | 21 |
2 | November | 1 | 23 |
但是,我还想将“0”添加到数据中,并得到如下内容:
count | month | day | hour |
---|---|---|---|
0 | October | 31 | 0 |
0 | October | 31 | 1 |
0 | October | 31 | 2 |
0 | October | 31 | 3 |
0 | October | 31 | 4 |
1 | October | 31 | 5 |
2 | October | 31 | 6 |
1 | October | 31 | 8 |
1 | October | 31 | 9 |
1 | October | 31 | 10 |
1 | October | 31 | 11 |
3 | October | 31 | 12 |
1 | October | 31 | 20 |
1 | October | 31 | 21 |
2 | October | 31 | 23 |
0 | November | 1 | 0 |
0 | November | 1 | 1 |
0 | November | 1 | 2 |
0 | November | 1 | 3 |
0 | November | 1 | 4 |
1 | November | 1 | 5 |
2 | November | 1 | 6 |
1 | November | 1 | 8 |
1 | November | 1 | 9 |
1 | November | 1 | 10 |
1 | November | 1 | 11 |
3 | November | 1 | 12 |
1 | November | 1 | 15 |
2 | November | 1 | 16 |
1 | November | 1 | 20 |
1 | November | 1 | 21 |
2 | November | 1 | 23 |
有没有办法以编程方式做到这一点?
【问题讨论】:
以代码形式提供您的输入数据。 【参考方案1】:我建议:
import calendar # note only used to revert month to literal
import pandas as pd
from datetime import datetime
# Create a new dataFrame counting people for each day
# note the renaming
hourly_data = df.groupby(['month', 'day', 'hour'], as_index=False).count().rename(columns= 'Name': 'count')
# Create a datetime column
# use pd.to_datetime on a string for the date and hour
# note insertions 1/ of a year, 2/ of minutes
hourly_data['DH'] = pd.to_datetime(hourly_data['day'].map(str) + " " + hourly_data['month'].map(str) + " " + '2021' + " " + hourly_data['hour'].map(str) + ":00")
# Keep only the needed columns
hourly_data = hourly_data[['DH', 'count']]
# Create the missing rows by setting an index with a hourly frequency
# note sorting to avoid some errors
hourly_data = hourly_data.set_index('DH').sort_index().asfreq('h')
# Fill the missing values for created rows with 0
hourly_data['count'].fillna(value = 0, inplace = True)
# Now we can revert count to int
hourly_data['count'] = hourly_data['count'].astype(int)
# Create columns from the index
hourly_data['year'], hourly_data['month'], hourly_data['day'], hourly_data['hour'] = hourly_data.index.year, hourly_data.index.month, hourly_data.index.day, hourly_data.index.hour
# and convert month to litteral
hourly_data['month'] = hourly_data['month'].apply(lambda i:calendar.month_name[i])
# Reorder columns
hourly_data = hourly_data[['year','month', 'day', 'hour', 'count']]
假设以下数据:
import io
s = """
Name month day hour
Albert October 31 5
John October 31 6
Jane October 31 6
Albert October 31 8
Jane October 31 23
Albert November 1 5
John November 1 6
Jane November 1 6
Albert November 1 8
Albert November 1 9
John November 1 10
Jane November 1 23
"""
df = pd.read_csv(io.StringIO(s), sep='\s+')
每小时数据:
year month day hour count
DH
2021-10-31 05:00:00 2021 October 31 5 1
2021-10-31 06:00:00 2021 October 31 6 2
2021-10-31 07:00:00 2021 October 31 7 0
2021-10-31 08:00:00 2021 October 31 8 1
2021-10-31 09:00:00 2021 October 31 9 0
2021-10-31 10:00:00 2021 October 31 10 0
2021-10-31 11:00:00 2021 October 31 11 0
2021-10-31 12:00:00 2021 October 31 12 0
2021-10-31 13:00:00 2021 October 31 13 0
2021-10-31 14:00:00 2021 October 31 14 0
2021-10-31 15:00:00 2021 October 31 15 0
2021-10-31 16:00:00 2021 October 31 16 0
2021-10-31 17:00:00 2021 October 31 17 0
2021-10-31 18:00:00 2021 October 31 18 0
2021-10-31 19:00:00 2021 October 31 19 0
2021-10-31 20:00:00 2021 October 31 20 0
2021-10-31 21:00:00 2021 October 31 21 0
2021-10-31 22:00:00 2021 October 31 22 0
2021-10-31 23:00:00 2021 October 31 23 1
2021-11-01 00:00:00 2021 November 1 0 0
2021-11-01 01:00:00 2021 November 1 1 0
2021-11-01 02:00:00 2021 November 1 2 0
2021-11-01 03:00:00 2021 November 1 3 0
2021-11-01 04:00:00 2021 November 1 4 0
2021-11-01 05:00:00 2021 November 1 5 1
2021-11-01 06:00:00 2021 November 1 6 2
2021-11-01 07:00:00 2021 November 1 7 0
2021-11-01 08:00:00 2021 November 1 8 1
2021-11-01 09:00:00 2021 November 1 9 1
2021-11-01 10:00:00 2021 November 1 10 1
2021-11-01 11:00:00 2021 November 1 11 0
2021-11-01 12:00:00 2021 November 1 12 0
2021-11-01 13:00:00 2021 November 1 13 0
2021-11-01 14:00:00 2021 November 1 14 0
2021-11-01 15:00:00 2021 November 1 15 0
2021-11-01 16:00:00 2021 November 1 16 0
2021-11-01 17:00:00 2021 November 1 17 0
2021-11-01 18:00:00 2021 November 1 18 0
2021-11-01 19:00:00 2021 November 1 19 0
2021-11-01 20:00:00 2021 November 1 20 0
2021-11-01 21:00:00 2021 November 1 21 0
2021-11-01 22:00:00 2021 November 1 22 0
2021-11-01 23:00:00 2021 November 1 23 1
并且 hourly_data 是类型
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43 entries, 2021-10-31 05:00:00 to 2021-11-01 23:00:00
Freq: H
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 43 non-null int64
1 month 43 non-null object
2 day 43 non-null int64
3 hour 43 non-null int64
4 count 43 non-null int64
dtypes: int64(4), object(1)
memory usage: 2.0+ KB
上面的索引类型DatetimeIndex
和它的频率Freq: H
保留索引是因为它对某些下游处理很有用。
【讨论】:
以上是关于在分组后为缺失的行添加零值的主要内容,如果未能解决你的问题,请参考以下文章
用之前的非缺失值填充缺失的 pandas 数据,按 key 分组