在多索引熊猫数据框的第二级填充 NaN

Posted

技术标签:

【中文标题】在多索引熊猫数据框的第二级填充 NaN【英文标题】:Fill NaN in second level of multi indexed pandas data frame 【发布时间】:2021-12-30 20:26:13 【问题描述】:

我确实有一个带有传感器数据的多索引熊猫数据框,如下所示:

                        high1       low1       high2        low2         offset
timestamp   channel                 
2021-01-01  A        966.6100   965.0300    967.7900    965.0300      27.307721
            B          1.4105     1.3900      1.4105      1.3900    2078.353670
2021-01-02  A        965.0300   966.4700    966.4800    965.0000      35.402437
            B          1.3900     1.3890      1.4028      1.3890     726.717821
2021-01-03  A        966.4600   966.0100    967.6800    965.4200      19.896296
            B             NaN        NaN         NaN         NaN            NaN
2021-01-04  A        966.6300   967.0000    967.0000    966.0300      12.958161
            B          1.4139     1.4140      1.4140      1.4139     692.804577
2021-01-05  A        967.0000   967.2000    967.2000    967.0000      10.345234
            B             NaN        NaN         NaN         NaN            NaN
2021-01-06  A        967.2000   967.0000    967.2500    967.0000       7.026761
            B          1.4140     1.4182      1.4182      1.4140     604.725766

现在我想将NaN 替换为该索引(A,B)同一列中的前一个数据点。我知道pandas.fillna (method='ffill'),但我不明白访问和设置匹配子表的概念。

df.xs ('B', level='channel') 确实返回了正确的数据,但作为更正数据的副本和分配似乎不可能这样。 df.loc[('01/01/2021',)] 可用于返回对子表的引用,但这不适用于二级索引,如df.loc[(,'B')]

有没有办法实现这个类似于熊猫的方式?

提前谢谢你!

【问题讨论】:

【参考方案1】:

简短的回答是您可能正在寻找

df.loc[(slice(None), 'B'), :] = df.loc[(slice(None), 'B'), :].fillna(method='ffill')

长答案如下。

在很多情况下,当 Pandas 返回原始数据集的副本时,可以使用相同的索引器将其“写回”。 df.loc-indexing 分层索引是使用元组完成的,例如

df.loc[(first_level_slice, second_level_slice, ...), :]

虽然不能在元组中使用df.loc[:] 简写符号,但冒号: 的所有用法都可以替换为slice(None)

df.loc[(:, 'B'), :]            # bad: syntax error
df.loc[(slice(None), 'B'), :]  # good

当然,这需要知道并记住slice(None) 符号。 pd.IndexSlice helper 提供了另一种表示法,它将速记表示法转换为 Python 对象:

>>> pd.IndexSlice[:, 'B'] == (slice(None), 'B')
True

并且the documentation 非常方便地建议引入更短的别名,这样可以使简短的答案更短:

idx = pd.IndexSlice
df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')

为了验证它是否有效,让我们尝试一下:

In [1]: data = [
   ...: "timestamp": "2021-01-01", "channel": "A", "high1": 966.6100, "low1": 965.0300, "high2": 967.7900, "low2": 965.0300, "offset": 27.307721,
   ...: "timestamp": "2021-01-01", "channel": "B", "high1": 1.4105, "low1": 1.3900, "high2": 1.4105, "low2": 1.3900, "offset": 2078.353670,
   ...: "timestamp": "2021-01-02", "channel": "A", "high1": 965.0300, "low1": 966.4700, "high2": 966.4800, "low2": 965.0000, "offset": 35.402437,
   ...: "timestamp": "2021-01-02", "channel": "B", "high1": 1.3900, "low1": 1.3890, "high2": 1.4028, "low2": 1.3890, "offset": 726.717821,
   ...: "timestamp": "2021-01-03", "channel": "A", "high1": 966.4600, "low1": 966.0100, "high2": 967.6800, "low2": 965.4200, "offset": 19.896296,
   ...: "timestamp": "2021-01-03", "channel": "B",
   ...: "timestamp": "2021-01-04", "channel": "A", "high1": 966.6300, "low1": 967.0000, "high2": 967.0000, "low2": 966.0300, "offset": 12.958161,
   ...: "timestamp": "2021-01-04", "channel": "B", "high1": 1.4139, "low1": 1.4140, "high2": 1.4140, "low2": 1.4139, "offset": 692.804577,
   ...: "timestamp": "2021-01-05", "channel": "A", "high1": 967.0000, "low1": 967.2000, "high2": 967.2000, "low2": 967.0000, "offset": 10.345234,
   ...: "timestamp": "2021-01-05", "channel": "B",
   ...: "timestamp": "2021-01-06", "channel": "A", "high1": 967.2000, "low1": 967.0000, "high2": 967.2500, "low2": 967.0000, "offset": 7.026761,
   ...: "timestamp": "2021-01-06", "channel": "B", "high1": 1.4140, "low1": 1.4182, "high2": 1.4182, "low2": 1.4140, "offset": 604.725766,
   ...: ]

In [2]: import pandas as pd

In [3]: df = pd.DataFrame.from_records(data).set_index(keys=['timestamp', 'channel'])

In [4]: df
Out[4]: 
                       high1      low1     high2      low2       offset
timestamp  channel                                                     
2021-01-01 A        966.6100  965.0300  967.7900  965.0300    27.307721
           B          1.4105    1.3900    1.4105    1.3900  2078.353670
2021-01-02 A        965.0300  966.4700  966.4800  965.0000    35.402437
           B          1.3900    1.3890    1.4028    1.3890   726.717821
2021-01-03 A        966.4600  966.0100  967.6800  965.4200    19.896296
           B             NaN       NaN       NaN       NaN          NaN
2021-01-04 A        966.6300  967.0000  967.0000  966.0300    12.958161
           B          1.4139    1.4140    1.4140    1.4139   692.804577
2021-01-05 A        967.0000  967.2000  967.2000  967.0000    10.345234
           B             NaN       NaN       NaN       NaN          NaN
2021-01-06 A        967.2000  967.0000  967.2500  967.0000     7.026761
           B          1.4140    1.4182    1.4182    1.4140   604.725766

In [4]: df.loc[(slice(None), 'B'), :]
Out[4]: 
                     high1    low1   high2    low2       offset
timestamp  channel                                             
2021-01-01 B        1.4105  1.3900  1.4105  1.3900  2078.353670
2021-01-02 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-03 B           NaN     NaN     NaN     NaN          NaN
2021-01-04 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-05 B           NaN     NaN     NaN     NaN          NaN
2021-01-06 B        1.4140  1.4182  1.4182  1.4140   604.725766

In [5]: idx = pd.IndexSlice

In [6]: df.loc[idx[:, 'B'], :]
Out[6]: 
                     high1    low1   high2    low2       offset
timestamp  channel                                             
2021-01-01 B        1.4105  1.3900  1.4105  1.3900  2078.353670
2021-01-02 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-03 B           NaN     NaN     NaN     NaN          NaN
2021-01-04 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-05 B           NaN     NaN     NaN     NaN          NaN
2021-01-06 B        1.4140  1.4182  1.4182  1.4140   604.725766

In [7]: df.loc[idx[:, 'B'], :].fillna(method='ffill')
Out[7]: 
                     high1    low1   high2    low2       offset
timestamp  channel                                             
2021-01-01 B        1.4105  1.3900  1.4105  1.3900  2078.353670
2021-01-02 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-03 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-04 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-05 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-06 B        1.4140  1.4182  1.4182  1.4140   604.725766

In [8]: df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')

In [9]: df
Out[9]: 
                       high1      low1     high2      low2       offset
timestamp  channel                                                     
2021-01-01 A        966.6100  965.0300  967.7900  965.0300    27.307721
           B          1.4105    1.3900    1.4105    1.3900  2078.353670
2021-01-02 A        965.0300  966.4700  966.4800  965.0000    35.402437
           B          1.3900    1.3890    1.4028    1.3890   726.717821
2021-01-03 A        966.4600  966.0100  967.6800  965.4200    19.896296
           B          1.3900    1.3890    1.4028    1.3890   726.717821
2021-01-04 A        966.6300  967.0000  967.0000  966.0300    12.958161
           B          1.4139    1.4140    1.4140    1.4139   692.804577
2021-01-05 A        967.0000  967.2000  967.2000  967.0000    10.345234
           B          1.4139    1.4140    1.4140    1.4139   692.804577
2021-01-06 A        967.2000  967.0000  967.2500  967.0000     7.026761
           B          1.4140    1.4182    1.4182    1.4140   604.725766

【讨论】:

以上是关于在多索引熊猫数据框的第二级填充 NaN的主要内容,如果未能解决你的问题,请参考以下文章

在熊猫多索引数据框中返回满足逻辑索引条件的每个组的最后一行[重复]

插值多索引熊猫数据框

如何在不合并索引的情况下连接具有不同多索引的两个数据帧?

来自按级别分组的多索引熊猫数据框的子图

来自另一个数据框的熊猫多索引分配

正在进行的数字作为熊猫中的第一个多索引