在缺少日期的多索引数据框中移动列

Posted 2023-03-12

技术标签:

【中文标题】在缺少日期的多索引数据框中移动列【英文标题】：Shifting column in multiindex dataframe with missting dates 【发布时间】：2020-09-04 02:26:13 【问题描述】：

我想在多索引数据框中移动一列，以便计算具有滞后自变量的回归模型。由于我的时间序列有缺失值，我只想让已知前几天的值发生变化。 df 看起来像这样：

                cost
ID  day
1   31.01.2020  0
1   03.02.2020  0
1   04.02.2020  0.12
1   05.02.2020  0
1   06.02.2020  0
1   07.02.2020  0.08
1   10.02.2020  0
1   11.02.2020  0
1   12.02.2020  0.03
1   13.02.2020  0.1
1   14.02.2020  0

想要的输出是这样的：

                cost   cost_lag
ID  day
1   31.01.2020  0      NaN
1   03.02.2020  0      NaN
1   04.02.2020  0.12   0
1   05.02.2020  0      0.12
1   06.02.2020  0      0
1   07.02.2020  0.08   0
1   10.02.2020  0      NaN
1   11.02.2020  0      0
1   12.02.2020  0.03   0
1   13.02.2020  0.1    0.03
1   14.02.2020  0      0.1

基于this answer to a similar question，我尝试了以下方法：

df['cost_lag'] = df.groupby(['id'])['cost'].shift(1)[df.reset_index().day == df.reset_index().day.shift(1) + datetime.timedelta(days=1)]

但这会导致我不明白的错误消息：

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

我还尝试按照here 建议的方法填充缺失的日期：

ams_spend_ranking_df = ams_spend_ranking_df.index.get_level_values(1).apply(lambda x: datetime.datetime(x, 1, 1))

再次导致错误消息无法启发我：

AttributeError: 'DatetimeIndex' object has no attribute 'apply'

长话短说：如果我没有前一天的数据，如何将成本列移动 1 天并添加 NaN？

【问题讨论】：

【参考方案1】：

您可以通过DataFrameGroupBy.resample 和Resampler.asfreq 添加所有缺少的日期时间：

df1 = df.reset_index(level=0).groupby(['ID'])['cost'].resample('d').asfreq()
print (df1)
ID  day       
1   2020-01-31    0.00
    2020-02-01     NaN
    2020-02-02     NaN
    2020-02-03    0.00
    2020-02-04    0.12
    2020-02-05    0.00
    2020-02-06    0.00
    2020-02-07    0.08
    2020-02-08     NaN
    2020-02-09     NaN
    2020-02-10    0.00
    2020-02-11    0.00
    2020-02-12    0.03
    2020-02-13    0.10
    2020-02-14    0.00
Name: cost, dtype: float64

那么，如果将您的解决方案与DataFrameGroupBy.shift 一起使用，它会像需要一样工作：

df['cost_lag'] = df1.groupby('ID').shift(1)
print (df)
               cost  cost_lag
ID day                       
1  2020-01-31  0.00       NaN
   2020-02-03  0.00       NaN
   2020-02-04  0.12      0.00
   2020-02-05  0.00      0.12
   2020-02-06  0.00      0.00
   2020-02-07  0.08      0.00
   2020-02-10  0.00       NaN
   2020-02-11  0.00      0.00
   2020-02-12  0.03      0.00
   2020-02-13  0.10      0.03
   2020-02-14  0.00      0.10

【讨论】：

谢谢@jezrael。不幸的是，这会导致一些奇怪的行为，如下所示：`cost cost_lag ID day 3 2020-03-23 0,11 0 3 2020-03-24 0 0,11 3 2020-03-25 0 0 3 2020-03-26 0,14 0 3 2020-03-28 0,14 0,14 3 2020-03-29 0,15 0,14 3 2020-03-30 0 0,15 3 2020-04-01 0 0,13 ` @TiTo - 真实数据有问题？还是带样品？ @TiTo - 你的熊猫版本是什么？我的熊猫版本是0.24.2 好的，我会更新我的 pandas 版本并再试一次。谢谢！

以上是关于在缺少日期的多索引数据框中移动列的主要内容，如果未能解决你的问题，请参考以下文章

在熊猫多索引数据框中返回满足逻辑索引条件的每个组的最后一行[重复]

如何将熊猫数据框多索引列移动到 2 行

Pandas，将多索引之一移动到多列索引之上

pandas：在多索引数据框中转换索引类型

在多索引数据框中添加新行作为标题

如何从多索引数据框中选择两个元素