如何计算每个月或一个月范围内的活动日期
Posted
技术标签:
【中文标题】如何计算每个月或一个月范围内的活动日期【英文标题】:How to calculate the active dates in every month or in a range of month 【发布时间】:2021-09-21 12:57:29 【问题描述】:我有一个DataFrame
喜欢:
Student_id | actvity_timestamp |
---|---|
1001 | 2019-09-05 08:26:12 |
1001 | 2019-09-06 09:26:12 |
1001 | 2019-09-21 10:11:01 |
1001 | 2019-10-24 11:44:01 |
1001 | 2019-10-25 11:31:01 |
1001 | 2019-10-26 12:13:01 |
1002 | 2019-09-11 12:21:01 |
1002 | 2019-09-12 13:11:01 |
1002 | 2019-11-23 16:22:01 |
我想要输出类似:
Student_id | total_active_days_in_Sept | total_active_days_in_Oct | total_active_days_in_Nov |
---|---|---|---|
1001 | 3 | 3 | 0 |
1002 | 2 | 0 | 1 |
如何做到这一点(actvity_timestamp
的输出列必须用几个月)?
【问题讨论】:
这能回答你的问题吗? pandas dataframe groupby datetime month 【参考方案1】:你可以尝试做类似这样的事情:
df = pd.DataFrame.from_dict(
"Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
"actvity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
)
months = pd.to_datetime(df.actvity_timestamp).dt.strftime("%B")
result = pd.crosstab(
df.Student_id,
months,
values=df.activity_timestamp.dt.date,
aggfunc=pd.Series.nunique # These last two parameters make it so that if a Student_id has been active more than once in a single day, to count it only once. (Thanks to @tlentali)
).fillna(0)
Series.dt.strftime 适用于日期时间系列,%B
格式化日期时间以仅显示月份名称。
result
将产生
actvity_timestamp November October September
Student_id
1001 0 3 3
1002 1 0 2
【讨论】:
这个答案非常简洁和 Pythonic。顺便说一句,我认为我们要计算唯一日期的数量而不是activity_timestamp
的数量。为了完成您的回答,我们可以指定一个 aggfunc
,例如:pd.crosstab(df.Student_id, pd.to_datetime(df.activity_timestamp).dt.strftime("%B"), values=df.activity_timestamp.dt.date, aggfunc=pd.Series.nunique).fillna(0)
你说得对,我也会把它包括在内,谢谢!【参考方案2】:
您可以达到所需的布局(列名按正确的月份顺序排序:'Sep' -> 'Oct' -> 'Nov' 而不是 'Nov' -> 'Oct' -> 'Sep ') 在以下步骤中:
1) 创建一个带有月份短名称的列。然后使用.pivot_table()
转换数据框 (每个Student_id
下每个月的活动日期计数汇总):
df['actvity_timestamp'] = pd.to_datetime(df['actvity_timestamp']) # to datetime format
df['activity_month'] = df['actvity_timestamp'].dt.strftime('%b') # get month short name
df['activity_date'] = df['actvity_timestamp'].dt.date # get activity dates
df_out = (df.pivot_table(index='Student_id', # group under each student id
columns='activity_month', # month short name as new columns
values='activity_date', # aggregate on dates
aggfunc='nunique', #activities on the same date counted once
fill_value=0)
.rename_axis(columns=None)
)
Nov Oct Sep
Student_id
1001 0 3 3
1002 1 0 2
2) 将月份短名的列名按.sort_index
与排序键参数排序回日历序列,如下:
df_out = df_out.sort_index(axis=1, key=lambda x: pd.to_datetime(x, format='%b').month)
Sep Oct Nov
Student_id
1001 3 3 0
1002 2 0 1
3) 通过.add_prefix()
进一步转换成想要的布局:
df_out = df_out.add_prefix('total_active_days_in_').reset_index()
结果:
print(df_out)
Student_id total_active_days_in_Sep total_active_days_in_Oct total_active_days_in_Nov
0 1001 3 3 0
1 1002 2 0 1
【讨论】:
【参考方案3】:从您的Dataframe
开始:
>>> import pandas as pd
>>> df = pd.DataFrame('Student_id': [1001, 1001, 1001, 1001, 1001, 1001, 1002, 1002, 1002],
... 'activity_timestamp': ['2019-09-05 08:26:12', '2019-09-06 09:26:12', '2019-09-21 10:11:01', '2019-10-24 11:44:01', '2019-10-25 11:31:01', '2019-10-26 12:13:01', '2019-09-11 12:21:01', '2019-09-12 13:11:01', '2019-11-23 16:22:01'],
... index = [0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> df
Student_id activity_timestamp
0 1001 2019-09-05 08:26:12
1 1001 2019-09-06 09:26:12
2 1001 2019-09-21 10:11:01
3 1001 2019-10-24 11:44:01
4 1001 2019-10-25 11:31:01
5 1001 2019-10-26 12:13:01
6 1002 2019-09-11 12:21:01
7 1002 2019-09-12 13:11:01
8 1002 2019-11-23 16:22:01
我们将activity_timestamp
转换为datetime
,然后像这样提取日期和月份:
>>> df['activity_timestamp'] = pd.to_datetime(df['activity_timestamp'], format='%Y-%m-%d %H:%M:%S.%f')
>>> df['date'] = df['activity_timestamp'].dt.date
>>> df['month'] = df['activity_timestamp'].dt.month_name()
>>> df
Student_id activity_timestamp date month
0 1001 2019-09-05 08:26:12 2019-09-05 September
1 1001 2019-09-05 08:26:13 2019-09-05 September
2 1001 2019-09-06 09:26:12 2019-09-06 September
3 1001 2019-09-21 10:11:01 2019-09-21 September
4 1001 2019-10-24 11:44:01 2019-10-24 October
5 1001 2019-10-25 11:31:01 2019-10-25 October
6 1001 2019-10-26 12:13:01 2019-10-26 October
7 1002 2019-09-11 12:21:01 2019-09-11 September
8 1002 2019-09-12 13:11:01 2019-09-12 September
9 1002 2019-11-23 16:22:01 2019-11-23 November
然后,我们使用pivot_table()
方法和nunique
函数代替count
来获取唯一日期的数量:
>>> df_result = (df.pivot_table(index='Student_id',
... columns='month',
... values='date',
... aggfunc=pd.Series.nunique,
... fill_value=0).rename_axis(columns=None)).add_prefix('total_active_days_in_').reset_index(drop=False)
>>> df_result
Student_id total_active_days_in_November total_active_days_in_October total_active_days_in_September
0 1001 0 3 3
1 1002 1 0 2
感谢@SeaBean 提供add_prefix
方法。
【讨论】:
以上是关于如何计算每个月或一个月范围内的活动日期的主要内容,如果未能解决你的问题,请参考以下文章