Pandas - 根据之前的行为进行插值

Posted

技术标签:

【中文标题】Pandas - 根据之前的行为进行插值【英文标题】:Pandas - Interpolate based on previous behavior 【发布时间】:2021-11-14 15:44:58 【问题描述】:

我有一个日期时间索引 pandas,其中包含需要在 nans 存在时进行插值的每小时数据。有时缺少 1 小时,线性插值就足够了,但有时可能需要几天,在这种情况下,我需要它来考虑它在上周的平均行为来填充这些值。关于如何做到这一点的任何想法?

现在我正在使用df.interpolate(method="linear"),但我需要修复它,以便在存在较大差距(连续超过 3 小时)时考虑过去 7 天(有数据)的平均值。因此,例如,如果2017-07-02 04:00:00 缺失,并且是较大差距的一部分,则应使用2017-06-25 和“2017-07-02”之间每天凌晨 4 点出现的平均值填充该值.

这是一个示例数据集

                      D31    D32                                                                                                                       
time                                                                                                                                                    
2017-07-01 00:00:00  118.0  118.0                                                                                                                       
2017-07-01 01:00:00  126.0  126.0                                                                                                                       
2017-07-01 02:00:00   96.0   np.nan                                                                                                                       
2017-07-01 03:00:00   88.0   88.0                                                                                                                       
2017-07-01 04:00:00   76.0   76.0                                                                                                                       
2017-07-01 05:00:00   60.0   60.0                                                                                                                       
2017-07-01 06:00:00   59.0   59.0                                                                                                                       2017-07-01 07:00:00   53.0   53.0                                                                                                                          
2017-07-01 08:00:00   54.0   54.0                                                                                                                       
2017-07-01 09:00:00   47.0   47.0                                                                                                                       
2017-07-01 10:00:00   48.0   48.0                                                                                                                       
2017-07-01 11:00:00   56.0   56.0                                                                                                                       
2017-07-01 12:00:00   65.0   65.0                                                                                                                       
2017-07-01 13:00:00   57.0   57.0                                                                                                                       
2017-07-01 14:00:00   46.0   46.0                                                                                                                       
2017-07-01 15:00:00   39.0   39.0                                                                                                                       
2017-07-01 16:00:00   24.0   24.0                                                                                                                       
2017-07-01 17:00:00   22.0   22.0                                                                                                                       
2017-07-01 18:00:00   np.nan   28.0                                                                                                                       
2017-07-01 19:00:00   np.nan   25.0                                                                                                                       
2017-07-01 20:00:00   38.0   38.0                                                                                                                       
2017-07-01 21:00:00   52.0   52.0                                                                                                                       
2017-07-01 22:00:00  123.0  123.0  
2017-07-01 23:00:00  np.nan  np.nan  
2017-07-02 00:00:00  np.nan  np.nan                                                                                                                        
2017-07-02 01:00:00  np.nan  np.nan                                                                                                                        
2017-07-02 02:00:00  np.nan  np.nan                                                                                                                        
2017-07-02 03:00:00  np.nan  np.nan                                                                                                                        
2017-07-02 04:00:00  np.nan  np.nan                                                                                                                       
2017-07-02 05:00:00  np.nan  np.nan                                                                                                                       
2017-07-02 06:00:00  np.nan  np.nan                                                                                                                       2017-07-02 07:00:00  np.nan  np.nan                                                                                                                           
2017-07-02 08:00:00  np.nan  np.nan                                                                                                                         
2017-07-02 09:00:00  np.nan  np.nan                                                                                                                        
2017-07-02 10:00:00  np.nan  np.nan                                                                                                                       
2017-07-02 11:00:00  np.nan  np.nan                                                                                                                       
2017-07-02 12:00:00  np.nan  np.nan                                                                                                                       
2017-07-02 13:00:00  np.nan  np.nan                                                                                                                       
2017-07-02 14:00:00  np.nan  np.nan                                                                                                                        
2017-07-02 15:00:00  np.nan  np.nan                                                                                                                       
2017-07-02 16:00:00  np.nan  np.nan                                                                                                                         
2017-07-02 17:00:00   np.nan  np.nan                                                                                                                         
2017-07-02 18:00:00   np.nan   28.0                                                                                                                       
2017-07-02 19:00:00   np.nan   25.0                                                                                                                       
2017-07-02 20:00:00   38.0   38.0                                                                                                                       
2017-07-02 21:00:00   52.0   52.0                                                                                                                       
2017-07-02 22:00:00  123.0  123.0      
2017-07-02 23:00:00  130.0  131.0  
2017-07-03 00:00:00  115.0  118.0                                                                                                                       
2017-07-03 01:00:00  126.0  128.0                                                                                                                       
2017-07-03 02:00:00   96.0   np.nan                                                                                                                       
2017-07-03 03:00:00   86.0   88.0                                                                                                                       
2017-07-03 04:00:00   77.0   75.0                                                                                                                       
2017-07-03 05:00:00   60.0   60.0                                                                                                                       
2017-07-03 06:00:00   61.0   59.0                                                                                                                       2017-07-03 07:00:00   57.0   53.0                                                                                                                          
2017-07-03 08:00:00   55.0   52.0                                                                                                                       
2017-07-03 09:00:00   47.0   48.0                                                                                                                       
2017-07-03 10:00:00   42.0   43.0                                                                                                                       
2017-07-03 11:00:00   56.0   57.0                                                                                                                       
2017-07-03 12:00:00   68.0   62.0                                                                                                                       
2017-07-03 13:00:00   56.0   57.0                                                                                                                       
2017-07-03 14:00:00   47.0   42.0                                                                                                                       
2017-07-03 15:00:00   33.0   37.0                                                                                                                       
2017-07-03 16:00:00   27.0   25.0                                                                                                                       
2017-07-03 17:00:00   24.0   20.0                                                                                                                       
2017-07-03 18:00:00   np.nan   28.0                                                       
2017-07-03 19:00:00   42.0   42.0                                                                                                                       
2017-07-03 20:00:00   42.0   42.0                                                                                                                       
2017-07-03 21:00:00   33.0   33.0                                                                                                                       
2017-07-03 22:00:00   35.0   35.0                                                                                                                       
2017-07-03 23:00:00   59.0   59.0 

谢谢!

编辑:

基于 W-M 响应,我能够执行以下代码,它适用于我想要的。

def interpolate_obs(df):
    def long_nan_series(series):
        # select this series when all values are NaNs
        all_nans = series.isnull().all()

        # and the delta is too long for interpolation
        # note: taking the last value minus the first,
        #       so this is the delta between the last NaN
        #       value and the first NaN value - there's
        #       an hour duration more until the next non-null value
        too_long = series.index[-1] - series.index[0] > pd.Timedelta("3 hours")
        return too_long & all_nans

    def get_average_value(series, mean_value, date):
        result=np.nan
        days=0       #(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)  
        days_mean=-1 #(1, 3, 5,  7,  9, 11, 13, 15, 17, 19)
        while result!=result:
            days+=3
            ## If nothing is found within a month (15 days before, 15 days after) then stop
            if days>15:
                return np.nan
            ## Get last week
            timedelta=pd.Timedelta("%s days"%days)
            working_data=series.loc[date-timedelta:date+timedelta]
            ## Get only the ones that are the same hour
            working_data=working_data[working_data.index.hour==date.hour]

            result=working_data.mean()
            if result==result:
                return result

            ## Get surrounding three days
            days_mean+=2
            timedelta=pd.Timedelta("%s days"%days_mean)
            working_data=mean_value.loc[date-timedelta:date+timedelta]
            ## Get only the ones that are the same hour
            working_data=working_data[workinreg_data.index.hour==date.hour]

            result=working_data.mean()
            if result==result:
                return result



    mean_value=df.mean(axis=1)
    for col in df.columns:
        series=df[col]
        df_nan_group_keys = series.isnull().diff().ne(0).cumsum()
        series_long_nans = series.groupby(df_nan_group_keys).transform(long_nan_series)
        
        ## Small gaps 
        series_small_gaps=series[~series_long_nans]  
        series_small_interp = series_small_gaps.interpolate(method="linear")

        ## Long gaps. Groups them by gaps
        series_long_gaps=series[series_long_nans]
        time_dif=series_long_gaps.index.to_series().diff()
        time_dif[time_dif>pd.Timedelta("1H")]=np.nan 
        time_dif=time_dif.replace(pd.Timedelta("1H"), 0).replace(np.nan, 1)  
        time_dif=time_dif.astype(int).cumsum()
        


        ## Retrieve each gap and find the new values
        both=pd.concat([series_long_gaps, time_dif], axis=1)
        both.columns=[col, "group"]
        series_long_interp=[]
        for group, df_group in both.groupby("group"):
            series_long_interp.append(df_group.apply(lambda x: get_average_value(series, mean_value, x.name), axis=1))
        series_long_interp=pd.concat(series_long_interp)


        df[col]=pd.concat([series_small_interp, series_long_interp]).sort_index() 

    return df


【问题讨论】:

顺便说一句,给定的数据已损坏。在此处查看股市示例,了解如何创建其他人可以使用的示例 DataFrame:***.com/a/30424537/463796 【参考方案1】:

这里有一个关于如何填充不同 NaN 块的解决方案,具体取决于它们的长度:

# count up by one every time NaN-state flips
d31_nan_group_keys = df.D31.isnull().diff().ne(0).cumsum()

def long_nan_series(series):
    # select this series when all values are NaNs
    all_nans = series.isnull().all()

    # and the delta is too long for interpolation
    # note: taking the last value minus the first,
    #       so this is the delta between the last NaN
    #       value and the first NaN value - there's
    #       an hour duration more until the next non-null value
    too_long = series.index[-1] - series.index[0] > pd.Timedelta("3 hours")
    return too_long & all_nans

# select NaN blocks too long for interpolation
d31_long_nans = df.D31.groupby(d31_nan_group_keys).transform(select_too_long)

# interpolation method
d31_interp = df.D31.interpolate(method="linear")

# TODO: other fill method depending on last week
d31_other = df.D31.fillna(method="ffill")

# mix them together
d31_res = d31_other.where(d31_long_nans, d31_interp)

我留给你计算d31_other 以另一种方式(从上周的值)填充 NaN,因为这是一个不同的问题。如果您在尝试实施时遇到困难,我建议您提出一个新问题。

【讨论】:

谢谢!您的回答帮助我奠定了基础,我能够完成其余的代码。【参考方案2】:

df.interpolate(method="time") 应该可以解决您的问题 只需确保您在日期时间中的索引 (df>index = pd.to_datetime(df>index)

【讨论】:

以上是关于Pandas - 根据之前的行为进行插值的主要内容,如果未能解决你的问题,请参考以下文章

使用带有浮点数的 pandas reindex:插值

pandas使用interpolate函数并设置method参数使用不同的插值算法对缺失值进行填充

Pandas - 在多个时间序列组内插值/插补缺失值

技巧 Pandas 数据填充

pandas入门

在 Python 中的某些 Dataframe 列上进行插补