Python/Pandas 分箱数据 Timedelta
Posted
技术标签:
【中文标题】Python/Pandas 分箱数据 Timedelta【英文标题】:Python/Pandas Binning Data Timedelta 【发布时间】:2018-04-06 10:01:32 【问题描述】:我有一个包含两列的 DataFrame
userID duration
0 DSm7ysk 03:08:49
1 no51CdJ 00:35:50
2 ...
'duration' 类型为 timedelta。我试过使用
bins = [dt.timedelta(minutes = 0), dt.timedelta(minutes =
5),dt.timedelta(minutes = 10),dt.timedelta(minutes =
20),dt.timedelta(minutes = 30), dt.timedelta(hours = 4)]
labels = ['0-5min','5-10min','10-20min','20-30min','30min+']
df['bins'] = pd.cut(df['duration'], bins, labels = labels)
但是,分箱数据不使用指定的分箱,而是为帧中的每个持续时间创建的。
将 timedelta 对象分箱到不规则箱中的最简单方法是什么?还是我只是在这里遗漏了一些明显的东西?
【问题讨论】:
【参考方案1】:您可以在分箱前标准化到秒。这减少了对整数进行分箱的问题。
df = pd.DataFrame('userID': ['A', 'B'],
'duration': pd.to_timedelta(['00:08:49', '00:35:50']))
L = ['00:00:00', '00:05:00', '00:10:00', '00:20:00', '00:30:00', '04:00:00']
bins = pd.to_timedelta(L).total_seconds()
cats = ['0-5min', '5-10min', '10-20min', '20-30min', '30min+']
df['bins'] = pd.cut(df['duration'].dt.total_seconds(), bins, labels=cats)
print(df)
# duration userID bins
# 0 00:08:49 A 5-10min
# 1 00:35:50 B 30min+
【讨论】:
【参考方案2】:它适用于我的 pandas 0.23.4
import pandas as pd
import numpy as np
df = pd.DataFrame(
'userID': ['DSm7ysk', 'no51CdJ', 'foo', 'bar'],
'duration': [pd.Timedelta('3 hours 8 minutes 49 seconds'), pd.Timedelta('35 minutes 50 seconds'), pd.Timedelta('1 minutes 13 seconds'), pd.Timedelta('6 minutes 43 seconds')]
)
bins = [
pd.Timedelta(minutes = 0),
pd.Timedelta(minutes = 5),
pd.Timedelta(minutes = 10),
pd.Timedelta(minutes = 20),
pd.Timedelta(minutes = 30),
pd.Timedelta(hours = 4)
]
labels = ['0-5min', '5-10min', '10-20min', '20-30min', '30min+']
df['bins'] = pd.cut(df['duration'], bins, labels = labels)
结果:
【讨论】:
以上是关于Python/Pandas 分箱数据 Timedelta的主要内容,如果未能解决你的问题,请参考以下文章