Pandas - 根据之前的行为进行插值
Posted
技术标签:
【中文标题】Pandas - 根据之前的行为进行插值【英文标题】:Pandas - Interpolate based on previous behavior 【发布时间】:2021-11-14 15:44:58 【问题描述】:我有一个日期时间索引 pandas,其中包含需要在 nans
存在时进行插值的每小时数据。有时缺少 1 小时,线性插值就足够了,但有时可能需要几天,在这种情况下,我需要它来考虑它在上周的平均行为来填充这些值。关于如何做到这一点的任何想法?
现在我正在使用df.interpolate(method="linear")
,但我需要修复它,以便在存在较大差距(连续超过 3 小时)时考虑过去 7 天(有数据)的平均值。因此,例如,如果2017-07-02 04:00:00
缺失,并且是较大差距的一部分,则应使用2017-06-25
和“2017-07-02”之间每天凌晨 4 点出现的平均值填充该值.
这是一个示例数据集
D31 D32
time
2017-07-01 00:00:00 118.0 118.0
2017-07-01 01:00:00 126.0 126.0
2017-07-01 02:00:00 96.0 np.nan
2017-07-01 03:00:00 88.0 88.0
2017-07-01 04:00:00 76.0 76.0
2017-07-01 05:00:00 60.0 60.0
2017-07-01 06:00:00 59.0 59.0 2017-07-01 07:00:00 53.0 53.0
2017-07-01 08:00:00 54.0 54.0
2017-07-01 09:00:00 47.0 47.0
2017-07-01 10:00:00 48.0 48.0
2017-07-01 11:00:00 56.0 56.0
2017-07-01 12:00:00 65.0 65.0
2017-07-01 13:00:00 57.0 57.0
2017-07-01 14:00:00 46.0 46.0
2017-07-01 15:00:00 39.0 39.0
2017-07-01 16:00:00 24.0 24.0
2017-07-01 17:00:00 22.0 22.0
2017-07-01 18:00:00 np.nan 28.0
2017-07-01 19:00:00 np.nan 25.0
2017-07-01 20:00:00 38.0 38.0
2017-07-01 21:00:00 52.0 52.0
2017-07-01 22:00:00 123.0 123.0
2017-07-01 23:00:00 np.nan np.nan
2017-07-02 00:00:00 np.nan np.nan
2017-07-02 01:00:00 np.nan np.nan
2017-07-02 02:00:00 np.nan np.nan
2017-07-02 03:00:00 np.nan np.nan
2017-07-02 04:00:00 np.nan np.nan
2017-07-02 05:00:00 np.nan np.nan
2017-07-02 06:00:00 np.nan np.nan 2017-07-02 07:00:00 np.nan np.nan
2017-07-02 08:00:00 np.nan np.nan
2017-07-02 09:00:00 np.nan np.nan
2017-07-02 10:00:00 np.nan np.nan
2017-07-02 11:00:00 np.nan np.nan
2017-07-02 12:00:00 np.nan np.nan
2017-07-02 13:00:00 np.nan np.nan
2017-07-02 14:00:00 np.nan np.nan
2017-07-02 15:00:00 np.nan np.nan
2017-07-02 16:00:00 np.nan np.nan
2017-07-02 17:00:00 np.nan np.nan
2017-07-02 18:00:00 np.nan 28.0
2017-07-02 19:00:00 np.nan 25.0
2017-07-02 20:00:00 38.0 38.0
2017-07-02 21:00:00 52.0 52.0
2017-07-02 22:00:00 123.0 123.0
2017-07-02 23:00:00 130.0 131.0
2017-07-03 00:00:00 115.0 118.0
2017-07-03 01:00:00 126.0 128.0
2017-07-03 02:00:00 96.0 np.nan
2017-07-03 03:00:00 86.0 88.0
2017-07-03 04:00:00 77.0 75.0
2017-07-03 05:00:00 60.0 60.0
2017-07-03 06:00:00 61.0 59.0 2017-07-03 07:00:00 57.0 53.0
2017-07-03 08:00:00 55.0 52.0
2017-07-03 09:00:00 47.0 48.0
2017-07-03 10:00:00 42.0 43.0
2017-07-03 11:00:00 56.0 57.0
2017-07-03 12:00:00 68.0 62.0
2017-07-03 13:00:00 56.0 57.0
2017-07-03 14:00:00 47.0 42.0
2017-07-03 15:00:00 33.0 37.0
2017-07-03 16:00:00 27.0 25.0
2017-07-03 17:00:00 24.0 20.0
2017-07-03 18:00:00 np.nan 28.0
2017-07-03 19:00:00 42.0 42.0
2017-07-03 20:00:00 42.0 42.0
2017-07-03 21:00:00 33.0 33.0
2017-07-03 22:00:00 35.0 35.0
2017-07-03 23:00:00 59.0 59.0
谢谢!
编辑:
基于 W-M 响应,我能够执行以下代码,它适用于我想要的。
def interpolate_obs(df):
def long_nan_series(series):
# select this series when all values are NaNs
all_nans = series.isnull().all()
# and the delta is too long for interpolation
# note: taking the last value minus the first,
# so this is the delta between the last NaN
# value and the first NaN value - there's
# an hour duration more until the next non-null value
too_long = series.index[-1] - series.index[0] > pd.Timedelta("3 hours")
return too_long & all_nans
def get_average_value(series, mean_value, date):
result=np.nan
days=0 #(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)
days_mean=-1 #(1, 3, 5, 7, 9, 11, 13, 15, 17, 19)
while result!=result:
days+=3
## If nothing is found within a month (15 days before, 15 days after) then stop
if days>15:
return np.nan
## Get last week
timedelta=pd.Timedelta("%s days"%days)
working_data=series.loc[date-timedelta:date+timedelta]
## Get only the ones that are the same hour
working_data=working_data[working_data.index.hour==date.hour]
result=working_data.mean()
if result==result:
return result
## Get surrounding three days
days_mean+=2
timedelta=pd.Timedelta("%s days"%days_mean)
working_data=mean_value.loc[date-timedelta:date+timedelta]
## Get only the ones that are the same hour
working_data=working_data[workinreg_data.index.hour==date.hour]
result=working_data.mean()
if result==result:
return result
mean_value=df.mean(axis=1)
for col in df.columns:
series=df[col]
df_nan_group_keys = series.isnull().diff().ne(0).cumsum()
series_long_nans = series.groupby(df_nan_group_keys).transform(long_nan_series)
## Small gaps
series_small_gaps=series[~series_long_nans]
series_small_interp = series_small_gaps.interpolate(method="linear")
## Long gaps. Groups them by gaps
series_long_gaps=series[series_long_nans]
time_dif=series_long_gaps.index.to_series().diff()
time_dif[time_dif>pd.Timedelta("1H")]=np.nan
time_dif=time_dif.replace(pd.Timedelta("1H"), 0).replace(np.nan, 1)
time_dif=time_dif.astype(int).cumsum()
## Retrieve each gap and find the new values
both=pd.concat([series_long_gaps, time_dif], axis=1)
both.columns=[col, "group"]
series_long_interp=[]
for group, df_group in both.groupby("group"):
series_long_interp.append(df_group.apply(lambda x: get_average_value(series, mean_value, x.name), axis=1))
series_long_interp=pd.concat(series_long_interp)
df[col]=pd.concat([series_small_interp, series_long_interp]).sort_index()
return df
【问题讨论】:
顺便说一句,给定的数据已损坏。在此处查看股市示例,了解如何创建其他人可以使用的示例 DataFrame:***.com/a/30424537/463796 【参考方案1】:这里有一个关于如何填充不同 NaN 块的解决方案,具体取决于它们的长度:
# count up by one every time NaN-state flips
d31_nan_group_keys = df.D31.isnull().diff().ne(0).cumsum()
def long_nan_series(series):
# select this series when all values are NaNs
all_nans = series.isnull().all()
# and the delta is too long for interpolation
# note: taking the last value minus the first,
# so this is the delta between the last NaN
# value and the first NaN value - there's
# an hour duration more until the next non-null value
too_long = series.index[-1] - series.index[0] > pd.Timedelta("3 hours")
return too_long & all_nans
# select NaN blocks too long for interpolation
d31_long_nans = df.D31.groupby(d31_nan_group_keys).transform(select_too_long)
# interpolation method
d31_interp = df.D31.interpolate(method="linear")
# TODO: other fill method depending on last week
d31_other = df.D31.fillna(method="ffill")
# mix them together
d31_res = d31_other.where(d31_long_nans, d31_interp)
我留给你计算d31_other
以另一种方式(从上周的值)填充 NaN,因为这是一个不同的问题。如果您在尝试实施时遇到困难,我建议您提出一个新问题。
【讨论】:
谢谢!您的回答帮助我奠定了基础,我能够完成其余的代码。【参考方案2】:df.interpolate(method="time") 应该可以解决您的问题 只需确保您在日期时间中的索引 (df>index = pd.to_datetime(df>index)
【讨论】:
以上是关于Pandas - 根据之前的行为进行插值的主要内容,如果未能解决你的问题,请参考以下文章