在多索引熊猫数据框的第二级填充 NaN
Posted
技术标签:
【中文标题】在多索引熊猫数据框的第二级填充 NaN【英文标题】:Fill NaN in second level of multi indexed pandas data frame 【发布时间】:2021-12-30 20:26:13 【问题描述】:我确实有一个带有传感器数据的多索引熊猫数据框,如下所示:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 A 966.6100 965.0300 967.7900 965.0300 27.307721
B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 A 965.0300 966.4700 966.4800 965.0000 35.402437
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 A 966.4600 966.0100 967.6800 965.4200 19.896296
B NaN NaN NaN NaN NaN
2021-01-04 A 966.6300 967.0000 967.0000 966.0300 12.958161
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 A 967.0000 967.2000 967.2000 967.0000 10.345234
B NaN NaN NaN NaN NaN
2021-01-06 A 967.2000 967.0000 967.2500 967.0000 7.026761
B 1.4140 1.4182 1.4182 1.4140 604.725766
现在我想将NaN
替换为该索引(A,B)同一列中的前一个数据点。我知道pandas.fillna (method='ffill')
,但我不明白访问和设置匹配子表的概念。
df.xs ('B', level='channel')
确实返回了正确的数据,但作为更正数据的副本和分配似乎不可能这样。
df.loc[('01/01/2021',)]
可用于返回对子表的引用,但这不适用于二级索引,如df.loc[(,'B')]
。
有没有办法实现这个类似于熊猫的方式?
提前谢谢你!
【问题讨论】:
【参考方案1】:简短的回答是您可能正在寻找
df.loc[(slice(None), 'B'), :] = df.loc[(slice(None), 'B'), :].fillna(method='ffill')
长答案如下。
在很多情况下,当 Pandas 返回原始数据集的副本时,可以使用相同的索引器将其“写回”。 df.loc
-indexing 分层索引是使用元组完成的,例如
df.loc[(first_level_slice, second_level_slice, ...), :]
虽然不能在元组中使用df.loc[:]
简写符号,但冒号:
的所有用法都可以替换为slice(None)
:
df.loc[(:, 'B'), :] # bad: syntax error
df.loc[(slice(None), 'B'), :] # good
当然,这需要知道并记住slice(None)
符号。 pd.IndexSlice
helper 提供了另一种表示法,它将速记表示法转换为 Python 对象:
>>> pd.IndexSlice[:, 'B'] == (slice(None), 'B')
True
并且the documentation 非常方便地建议引入更短的别名,这样可以使简短的答案更短:
idx = pd.IndexSlice
df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')
为了验证它是否有效,让我们尝试一下:
In [1]: data = [
...: "timestamp": "2021-01-01", "channel": "A", "high1": 966.6100, "low1": 965.0300, "high2": 967.7900, "low2": 965.0300, "offset": 27.307721,
...: "timestamp": "2021-01-01", "channel": "B", "high1": 1.4105, "low1": 1.3900, "high2": 1.4105, "low2": 1.3900, "offset": 2078.353670,
...: "timestamp": "2021-01-02", "channel": "A", "high1": 965.0300, "low1": 966.4700, "high2": 966.4800, "low2": 965.0000, "offset": 35.402437,
...: "timestamp": "2021-01-02", "channel": "B", "high1": 1.3900, "low1": 1.3890, "high2": 1.4028, "low2": 1.3890, "offset": 726.717821,
...: "timestamp": "2021-01-03", "channel": "A", "high1": 966.4600, "low1": 966.0100, "high2": 967.6800, "low2": 965.4200, "offset": 19.896296,
...: "timestamp": "2021-01-03", "channel": "B",
...: "timestamp": "2021-01-04", "channel": "A", "high1": 966.6300, "low1": 967.0000, "high2": 967.0000, "low2": 966.0300, "offset": 12.958161,
...: "timestamp": "2021-01-04", "channel": "B", "high1": 1.4139, "low1": 1.4140, "high2": 1.4140, "low2": 1.4139, "offset": 692.804577,
...: "timestamp": "2021-01-05", "channel": "A", "high1": 967.0000, "low1": 967.2000, "high2": 967.2000, "low2": 967.0000, "offset": 10.345234,
...: "timestamp": "2021-01-05", "channel": "B",
...: "timestamp": "2021-01-06", "channel": "A", "high1": 967.2000, "low1": 967.0000, "high2": 967.2500, "low2": 967.0000, "offset": 7.026761,
...: "timestamp": "2021-01-06", "channel": "B", "high1": 1.4140, "low1": 1.4182, "high2": 1.4182, "low2": 1.4140, "offset": 604.725766,
...: ]
In [2]: import pandas as pd
In [3]: df = pd.DataFrame.from_records(data).set_index(keys=['timestamp', 'channel'])
In [4]: df
Out[4]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 A 966.6100 965.0300 967.7900 965.0300 27.307721
B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 A 965.0300 966.4700 966.4800 965.0000 35.402437
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 A 966.4600 966.0100 967.6800 965.4200 19.896296
B NaN NaN NaN NaN NaN
2021-01-04 A 966.6300 967.0000 967.0000 966.0300 12.958161
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 A 967.0000 967.2000 967.2000 967.0000 10.345234
B NaN NaN NaN NaN NaN
2021-01-06 A 967.2000 967.0000 967.2500 967.0000 7.026761
B 1.4140 1.4182 1.4182 1.4140 604.725766
In [4]: df.loc[(slice(None), 'B'), :]
Out[4]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 B NaN NaN NaN NaN NaN
2021-01-04 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 B NaN NaN NaN NaN NaN
2021-01-06 B 1.4140 1.4182 1.4182 1.4140 604.725766
In [5]: idx = pd.IndexSlice
In [6]: df.loc[idx[:, 'B'], :]
Out[6]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 B NaN NaN NaN NaN NaN
2021-01-04 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 B NaN NaN NaN NaN NaN
2021-01-06 B 1.4140 1.4182 1.4182 1.4140 604.725766
In [7]: df.loc[idx[:, 'B'], :].fillna(method='ffill')
Out[7]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-04 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-06 B 1.4140 1.4182 1.4182 1.4140 604.725766
In [8]: df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')
In [9]: df
Out[9]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 A 966.6100 965.0300 967.7900 965.0300 27.307721
B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 A 965.0300 966.4700 966.4800 965.0000 35.402437
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 A 966.4600 966.0100 967.6800 965.4200 19.896296
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-04 A 966.6300 967.0000 967.0000 966.0300 12.958161
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 A 967.0000 967.2000 967.2000 967.0000 10.345234
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-06 A 967.2000 967.0000 967.2500 967.0000 7.026761
B 1.4140 1.4182 1.4182 1.4140 604.725766
【讨论】:
以上是关于在多索引熊猫数据框的第二级填充 NaN的主要内容,如果未能解决你的问题,请参考以下文章