如何按 NAN 值拆分熊猫时间序列

Posted 2023-03-11

技术标签:

【中文标题】如何按 NAN 值拆分熊猫时间序列【英文标题】：How to split a pandas time-series by NAN values 【发布时间】：2014-02-19 13:29:26 【问题描述】：

我有一个看起来像这样的熊猫时间序列：

2007-02-06 15:00:00    0.780
2007-02-06 16:00:00    0.125
2007-02-06 17:00:00    0.875
2007-02-06 18:00:00      NaN
2007-02-06 19:00:00    0.565
2007-02-06 20:00:00    0.875
2007-02-06 21:00:00    0.910
2007-02-06 22:00:00    0.780
2007-02-06 23:00:00      NaN
2007-02-07 00:00:00      NaN
2007-02-07 01:00:00    0.780
2007-02-07 02:00:00    0.580
2007-02-07 03:00:00    0.880
2007-02-07 04:00:00    0.791
2007-02-07 05:00:00      NaN

每当连续出现一个或多个 NaN 值时，我想拆分 pandas TimeSeries。目标是我将事件分开。

Event1:
2007-02-06 15:00:00    0.780
2007-02-06 16:00:00    0.125
2007-02-06 17:00:00    0.875

Event2:
2007-02-06 19:00:00    0.565
2007-02-06 20:00:00    0.875
2007-02-06 21:00:00    0.910
2007-02-06 22:00:00    0.780

我可以循环遍历每一行，但还有一种聪明的方法吗？？？

【问题讨论】：

【参考方案1】：

您可以使用numpy.split，然后过滤结果列表。下面是一个示例，假设具有值的列标记为"value"：

events = np.split(df, np.where(np.isnan(df.value))[0])
# removing NaN entries
events = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]
# removing empty DataFrames
events = [ev for ev in events if not ev.empty]

您将获得一个列表，其中包含由 NaN 值分隔的所有事件。

【讨论】：

这似乎应该更容易，但我找不到方法。仅供参考，当用于具有稀疏非空值的大型数据集（约 500k 行）时，此方法的性能非常差 @bloudermilk 这是一个很好的观察。对于如此庞大的数据集，您是否找到了另一种解决方案 @SaulloCastro 是的！我确实设法找到了一种利用 SparseIndex 的有趣方法。我今天会尝试发布我的解决方案。 @SaulloCastro 在下面看到我的答案！【参考方案2】：

请注意，此答案适用于 pandasthis answer by thesofakillers

我为非常大和稀疏的数据集找到了一种有效的解决方案。在我的例子中，数十万行只有十几个在NaN 值之间的简短数据段。我（ab）使用了pandas.SparseIndex 的内部结构，这是一个帮助压缩内存中稀疏数据集的功能。

给定一些数据：

import pandas as pd
import numpy as np

# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)

# Three blocks of non-null data throughout timeseries
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)

看起来像：

2011-01-01 00:00:00   NaN
2011-01-01 00:00:01   NaN
2011-01-01 00:00:02   NaN
2011-01-01 00:00:03   NaN
                       ..
2011-01-10 23:59:56   NaN
2011-01-10 23:59:57   NaN
2011-01-10 23:59:58   NaN
2011-01-10 23:59:59   NaN
Freq: S, Length: 864000, dtype: float64

我们可以轻松高效地找到块：

# Convert to sparse then query index to find block locations
sparse_ts = dense_ts.to_sparse()
block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)

# Map the sparse blocks back to the dense timeseries
blocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]

瞧：

[2011-01-01 00:08:20    0.531793
 2011-01-01 00:08:21    0.484391
 2011-01-01 00:08:22    0.022686
 2011-01-01 00:08:23   -0.206495
 2011-01-01 00:08:24    1.472209
 2011-01-01 00:08:25   -1.261940
 2011-01-01 00:08:26   -0.696388
 2011-01-01 00:08:27   -0.219316
 2011-01-01 00:08:28   -0.474840
 Freq: S, dtype: float64, 2011-01-01 03:20:00   -0.147190
 2011-01-01 03:20:01    0.299565
 2011-01-01 03:20:02   -0.846878
 2011-01-01 03:20:03   -0.100975
 2011-01-01 03:20:04    1.288872
 2011-01-01 03:20:05   -0.092474
 2011-01-01 03:20:06   -0.214774
 2011-01-01 03:20:07   -0.540479
 2011-01-01 03:20:08   -0.661083
 2011-01-01 03:20:09    1.129878
 2011-01-01 03:20:10    0.791373
 2011-01-01 03:20:11    0.119564
 2011-01-01 03:20:12    0.345459
 2011-01-01 03:20:13   -0.272132
 Freq: S, dtype: float64, 2011-01-01 05:33:20    1.028268
 2011-01-01 05:33:21    1.476468
 2011-01-01 05:33:22    1.308881
 2011-01-01 05:33:23    1.458202
 2011-01-01 05:33:24   -0.874308
                              ..
 2011-01-01 05:34:02    0.941446
 2011-01-01 05:34:03   -0.996767
 2011-01-01 05:34:04    1.266660
 2011-01-01 05:34:05   -0.391560
 2011-01-01 05:34:06    1.498499
 2011-01-01 05:34:07   -0.636908
 2011-01-01 05:34:08    0.621681
 Freq: S, dtype: float64]

【讨论】：

非常好！！我用过这个。对我来说，这会跳过 nan 之前的最后一个条目，可能只需要稍作调整。删除 blocks= 行中的 -1。不确定它的用途。 to_sparse() 现已弃用（请参阅pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html），看起来SparseArray 将来可以使用【参考方案3】：

对于任何寻找bloudermilk 答案的非弃用 (pandas>=0.25.0) 版本的人，在对pandas sparse source code 进行了一些挖掘之后，我想出了以下内容。我试图使其与他们的答案尽可能相似，以便您进行比较：

给定一些数据：

import pandas as pd
import numpy as np

# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')

# NaN data interspersed with 3 blocks of non-NaN data
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)

看起来像：

2011-01-01 00:00:00   NaN
2011-01-01 00:00:01   NaN
2011-01-01 00:00:02   NaN
2011-01-01 00:00:03   NaN
2011-01-01 00:00:04   NaN
                       ..
2011-01-10 23:59:55   NaN
2011-01-10 23:59:56   NaN
2011-01-10 23:59:57   NaN
2011-01-10 23:59:58   NaN
2011-01-10 23:59:59   NaN
Freq: S, Length: 864000, dtype: float64

我们可以轻松高效地找到块：

# Convert to sparse then query index to find block locations
# different way of converting to sparse in pandas>=0.25.0
sparse_ts = dense_ts.astype(pd.SparseDtype('float'))
# we need to use .values.sp_index.to_block_index() in this version of pandas
block_locs = zip(
    sparse_ts.values.sp_index.to_block_index().blocs,
    sparse_ts.values.sp_index.to_block_index().blengths,
)
# Map the sparse blocks back to the dense timeseries
blocks = [
    dense_ts.iloc[start : (start + length - 1)]
    for (start, length) in block_locs
]

瞧

> blocks
[2011-01-01 00:08:20    0.092338
 2011-01-01 00:08:21    1.196703
 2011-01-01 00:08:22    0.936586
 2011-01-01 00:08:23   -0.354768
 2011-01-01 00:08:24   -0.209642
 2011-01-01 00:08:25   -0.750103
 2011-01-01 00:08:26    1.344343
 2011-01-01 00:08:27    1.446148
 2011-01-01 00:08:28    1.174443
 Freq: S, dtype: float64,
 2011-01-01 03:20:00    1.327026
 2011-01-01 03:20:01   -0.431162
 2011-01-01 03:20:02   -0.461407
 2011-01-01 03:20:03   -1.330671
 2011-01-01 03:20:04   -0.892480
 2011-01-01 03:20:05   -0.323433
 2011-01-01 03:20:06    2.520965
 2011-01-01 03:20:07    0.140757
 2011-01-01 03:20:08   -1.688278
 2011-01-01 03:20:09    0.856346
 2011-01-01 03:20:10    0.013968
 2011-01-01 03:20:11    0.204514
 2011-01-01 03:20:12    0.287756
 2011-01-01 03:20:13   -0.727400
 Freq: S, dtype: float64,
 2011-01-01 05:33:20   -1.409744
 2011-01-01 05:33:21    0.338251
 2011-01-01 05:33:22    0.215555
 2011-01-01 05:33:23   -0.309874
 2011-01-01 05:33:24    0.753737
 2011-01-01 05:33:25   -0.349966
 2011-01-01 05:33:26    0.074758
 2011-01-01 05:33:27   -1.574485
 2011-01-01 05:33:28    0.595844
 2011-01-01 05:33:29   -0.670004
 2011-01-01 05:33:30    1.655479
 2011-01-01 05:33:31   -0.362853
 2011-01-01 05:33:32    0.167355
 2011-01-01 05:33:33    0.703780
 2011-01-01 05:33:34    2.633756
 2011-01-01 05:33:35    1.898891
 2011-01-01 05:33:36   -1.129365
 2011-01-01 05:33:37   -0.765057
 2011-01-01 05:33:38    0.279869
 2011-01-01 05:33:39    1.388705
 2011-01-01 05:33:40   -1.420761
 2011-01-01 05:33:41    0.455692
 2011-01-01 05:33:42    0.367106
 2011-01-01 05:33:43    0.856598
 2011-01-01 05:33:44    1.920748
 2011-01-01 05:33:45    0.648581
 2011-01-01 05:33:46   -0.606784
 2011-01-01 05:33:47   -0.246285
 2011-01-01 05:33:48   -0.040520
 2011-01-01 05:33:49    1.422764
 2011-01-01 05:33:50   -1.686568
 2011-01-01 05:33:51    1.282430
 2011-01-01 05:33:52    1.358482
 2011-01-01 05:33:53   -0.998765
 2011-01-01 05:33:54   -0.009527
 2011-01-01 05:33:55    0.647671
 2011-01-01 05:33:56   -1.098435
 2011-01-01 05:33:57   -0.638245
 2011-01-01 05:33:58   -1.820668
 2011-01-01 05:33:59    0.768250
 2011-01-01 05:34:00   -1.029975
 2011-01-01 05:34:01   -0.744205
 2011-01-01 05:34:02    1.627130
 2011-01-01 05:34:03    2.058689
 2011-01-01 05:34:04   -1.194971
 2011-01-01 05:34:05    1.293214
 2011-01-01 05:34:06    0.029523
 2011-01-01 05:34:07   -0.405785
 2011-01-01 05:34:08    0.837123
 Freq: S, dtype: float64]

【讨论】：

以上是关于如何按 NAN 值拆分熊猫时间序列的主要内容，如果未能解决你的问题，请参考以下文章