使用带有 DateOffset 的 pandas Series.rolling

Posted

技术标签:

【中文标题】使用带有 DateOffset 的 pandas Series.rolling【英文标题】:using pandas Series.rolling with DateOffset 【发布时间】:2017-08-27 19:33:15 【问题描述】:

Python、Pandas、数据分析在这里。

所以我要做的是从大量 apache 服务器日志中找出最繁忙的 60 分钟时间间隔。我已将日志中的时间戳提取到一个列表中。

time_recieved 是一个具有类似值的列表

[
1995-07-01T00:01:18-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:46-04:00,
1995-07-01T00:13:47-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:50-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:14:11-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:18-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:23-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:26-04:00,
1995-07-01T00:14:27-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:31-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:36-04:00,
]

我的目标是,沿着这个时间戳列表,我将能够获得从其中任何一个点开始的 60 分钟间隔的计数。一旦我启动滚动窗口,我想我可以处理它。

关于熊猫文档: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html 我发现以下关于 window 参数的条目 " 窗口:整数或偏移量 移动窗口的大小。这是用于计算统计量的观察数。每个窗口的大小都是固定的。 如果它是一个偏移量,那么这将是每个窗口的时间段。每个窗口的大小都是可变的,基于时间段中包含的观察结果。这仅对 datetimelike 索引有效。这是 0.19.0 中的新功能 "

我正在使用 pandas 19.2,根据时间段内的观察结果使用可变大小的窗口的选项听起来正是我想要的。所以我尝试实现它:

import pandas as pd
from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 60):    
    time_window = DateOffset(minutes = 60)
    print (type(time_window))
    series = pd.Series(data)
    series.rolling(time_window).count()
    return series  

busiest_tf = busiest_timeframe(time_received)    

我收到以下错误: raise ValueError("窗口必须是整数")

ValueError: window must be an integer

还有其他一些我正在使用的偏移对象吗?这个熊猫功能不起作用吗?我误解了文档吗?

提前感谢您的帮助和建议!

【问题讨论】:

series.rolling 遍历 number 个观察值,而不是 时间窗口。所以,第一个参数必须是整数。 您可能正在寻找重采样器,而不是窗口:series.resample('60M').count()。但是,重采样器不会滚动,它只是将您的系列分成 60 分钟的组。 DYZ pandas 文档说“如果它是一个偏移量,那么这将是每个窗口的时间段。每个窗口的大小将根据 time_period 中包含的观察结果变化” 这是 0.19.0 中的新功能”。你的 pandas 至少是 0.19.0 吗? 我使用的是 pandas 19.2,我检查了 pd.__version__ 【参考方案1】:

遗憾的是,我不知道如何使用 series.rolling,好像您没有将它设置为索引,这就是它不起作用的原因。但即便如此我也遇到了错误,所以这里有一个替代方案(可能真的很丑),所以如果其他人有更好的方法,最好听别人的。

是的,它使用布尔索引。玩弄代码(大量的打印语句),如果需要,可以将 >= 和 和

liste=[
"1995-07-01T00:01:18-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:46-04:00",
"1995-07-01T00:13:47-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:50-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:14:11-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:18-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:23-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:26-04:00",
"1995-07-01T00:14:27-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:31-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:36-04:00"
]
import pandas as pd

from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 1):

    series = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S') #maybe you dont need the to_datetime here. I did.
    df=series.to_frame(name="time")
    df["count"]=[df[(df["time"] >= x) & (df["time"] <= (x+pd.Timedelta(seconds=timeframe)))].size for x in df["time"].values] #change seconds to minutes or whatever you want
    highest_index=df["count"].idxmax()
    #print(df.ix[highest_index]["time"])
    df2=df[(df["time"] >= df.ix[highest_index]["time"]) & (df["time"] <= (df.ix[highest_index]["time"]+pd.Timedelta(seconds=timeframe)))] #change seconds here to th same as above
    return df2
print(busiest_timeframe(liste))

【讨论】:

【参考方案2】:

尝试使用偏移别名而不是 DateOffset:

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

来自the docs的示例:

import pandas as pd
import numpy as np

df = pd.DataFrame('B': [0, 1, 2, np.nan, 4],
                  index = [pd.Timestamp('20130101 09:00:00'),
                           pd.Timestamp('20130101 09:00:02'),
                           pd.Timestamp('20130101 09:00:03'),
                           pd.Timestamp('20130101 09:00:05'),
                           pd.Timestamp('20130101 09:00:06')])

print(df.rolling('2s').count())

输出:

                       B
2013-01-01 09:00:00  1.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  2.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  1.0

【讨论】:

以上是关于使用带有 DateOffset 的 pandas Series.rolling的主要内容,如果未能解决你的问题,请参考以下文章

Pandas DateOffset,退后一天

将 `pandas` 频率字符串转换为 `DateOffset`

如何在 Pandas/Numpy 中使用 dateOffset 对日内时间序列数据进行重新采样?

pandas使用pd.DateOffset生成时间偏移量把dataframe数据中的时间数据列统一相减N天缩小向前偏移N天

pandas使用pd.DateOffset生成时间偏移量把dataframe数据中的时间数据列统一相加N天放大向后偏移N天

pandas使用pd.DateOffset生成时间偏移量(指定年数月数天数小时分钟)把dataframe数据中的时间数据列统一偏移(相减偏移向前偏移时间减小)