如何按 NAN 值拆分熊猫时间序列
Posted
技术标签:
【中文标题】如何按 NAN 值拆分熊猫时间序列【英文标题】:How to split a pandas time-series by NAN values 【发布时间】:2014-02-19 13:29:26 【问题描述】:我有一个看起来像这样的熊猫时间序列:
2007-02-06 15:00:00 0.780
2007-02-06 16:00:00 0.125
2007-02-06 17:00:00 0.875
2007-02-06 18:00:00 NaN
2007-02-06 19:00:00 0.565
2007-02-06 20:00:00 0.875
2007-02-06 21:00:00 0.910
2007-02-06 22:00:00 0.780
2007-02-06 23:00:00 NaN
2007-02-07 00:00:00 NaN
2007-02-07 01:00:00 0.780
2007-02-07 02:00:00 0.580
2007-02-07 03:00:00 0.880
2007-02-07 04:00:00 0.791
2007-02-07 05:00:00 NaN
每当连续出现一个或多个 NaN 值时,我想拆分 pandas TimeSeries。目标是我将事件分开。
Event1:
2007-02-06 15:00:00 0.780
2007-02-06 16:00:00 0.125
2007-02-06 17:00:00 0.875
Event2:
2007-02-06 19:00:00 0.565
2007-02-06 20:00:00 0.875
2007-02-06 21:00:00 0.910
2007-02-06 22:00:00 0.780
我可以循环遍历每一行,但还有一种聪明的方法吗???
【问题讨论】:
【参考方案1】:您可以使用numpy.split
,然后过滤结果列表。下面是一个示例,假设具有值的列标记为"value"
:
events = np.split(df, np.where(np.isnan(df.value))[0])
# removing NaN entries
events = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]
# removing empty DataFrames
events = [ev for ev in events if not ev.empty]
您将获得一个列表,其中包含由 NaN
值分隔的所有事件。
【讨论】:
这似乎应该更容易,但我找不到方法。 仅供参考,当用于具有稀疏非空值的大型数据集(约 500k 行)时,此方法的性能非常差 @bloudermilk 这是一个很好的观察。对于如此庞大的数据集,您是否找到了另一种解决方案 @SaulloCastro 是的!我确实设法找到了一种利用 SparseIndex 的有趣方法。我今天会尝试发布我的解决方案。 @SaulloCastro 在下面看到我的答案!【参考方案2】:请注意,此答案适用于 pandasthis answer by thesofakillers
我为非常大和稀疏的数据集找到了一种有效的解决方案。在我的例子中,数十万行只有十几个在NaN
值之间的简短数据段。我(ab)使用了pandas.SparseIndex
的内部结构,这是一个帮助压缩内存中稀疏数据集的功能。
给定一些数据:
import pandas as pd
import numpy as np
# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
# Three blocks of non-null data throughout timeseries
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)
看起来像:
2011-01-01 00:00:00 NaN
2011-01-01 00:00:01 NaN
2011-01-01 00:00:02 NaN
2011-01-01 00:00:03 NaN
..
2011-01-10 23:59:56 NaN
2011-01-10 23:59:57 NaN
2011-01-10 23:59:58 NaN
2011-01-10 23:59:59 NaN
Freq: S, Length: 864000, dtype: float64
我们可以轻松高效地找到块:
# Convert to sparse then query index to find block locations
sparse_ts = dense_ts.to_sparse()
block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)
# Map the sparse blocks back to the dense timeseries
blocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]
瞧:
[2011-01-01 00:08:20 0.531793
2011-01-01 00:08:21 0.484391
2011-01-01 00:08:22 0.022686
2011-01-01 00:08:23 -0.206495
2011-01-01 00:08:24 1.472209
2011-01-01 00:08:25 -1.261940
2011-01-01 00:08:26 -0.696388
2011-01-01 00:08:27 -0.219316
2011-01-01 00:08:28 -0.474840
Freq: S, dtype: float64, 2011-01-01 03:20:00 -0.147190
2011-01-01 03:20:01 0.299565
2011-01-01 03:20:02 -0.846878
2011-01-01 03:20:03 -0.100975
2011-01-01 03:20:04 1.288872
2011-01-01 03:20:05 -0.092474
2011-01-01 03:20:06 -0.214774
2011-01-01 03:20:07 -0.540479
2011-01-01 03:20:08 -0.661083
2011-01-01 03:20:09 1.129878
2011-01-01 03:20:10 0.791373
2011-01-01 03:20:11 0.119564
2011-01-01 03:20:12 0.345459
2011-01-01 03:20:13 -0.272132
Freq: S, dtype: float64, 2011-01-01 05:33:20 1.028268
2011-01-01 05:33:21 1.476468
2011-01-01 05:33:22 1.308881
2011-01-01 05:33:23 1.458202
2011-01-01 05:33:24 -0.874308
..
2011-01-01 05:34:02 0.941446
2011-01-01 05:34:03 -0.996767
2011-01-01 05:34:04 1.266660
2011-01-01 05:34:05 -0.391560
2011-01-01 05:34:06 1.498499
2011-01-01 05:34:07 -0.636908
2011-01-01 05:34:08 0.621681
Freq: S, dtype: float64]
【讨论】:
非常好!!我用过这个。 对我来说,这会跳过 nan 之前的最后一个条目,可能只需要稍作调整。删除 blocks= 行中的 -1。不确定它的用途。to_sparse()
现已弃用(请参阅pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html),看起来SparseArray
将来可以使用【参考方案3】:
对于任何寻找bloudermilk 答案的非弃用 (pandas>=0.25.0) 版本的人,在对pandas sparse source code 进行了一些挖掘之后,我想出了以下内容。我试图使其与他们的答案尽可能相似,以便您进行比较:
给定一些数据:
import pandas as pd
import numpy as np
# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
# NaN data interspersed with 3 blocks of non-NaN data
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)
看起来像:
2011-01-01 00:00:00 NaN
2011-01-01 00:00:01 NaN
2011-01-01 00:00:02 NaN
2011-01-01 00:00:03 NaN
2011-01-01 00:00:04 NaN
..
2011-01-10 23:59:55 NaN
2011-01-10 23:59:56 NaN
2011-01-10 23:59:57 NaN
2011-01-10 23:59:58 NaN
2011-01-10 23:59:59 NaN
Freq: S, Length: 864000, dtype: float64
我们可以轻松高效地找到块:
# Convert to sparse then query index to find block locations
# different way of converting to sparse in pandas>=0.25.0
sparse_ts = dense_ts.astype(pd.SparseDtype('float'))
# we need to use .values.sp_index.to_block_index() in this version of pandas
block_locs = zip(
sparse_ts.values.sp_index.to_block_index().blocs,
sparse_ts.values.sp_index.to_block_index().blengths,
)
# Map the sparse blocks back to the dense timeseries
blocks = [
dense_ts.iloc[start : (start + length - 1)]
for (start, length) in block_locs
]
瞧
> blocks
[2011-01-01 00:08:20 0.092338
2011-01-01 00:08:21 1.196703
2011-01-01 00:08:22 0.936586
2011-01-01 00:08:23 -0.354768
2011-01-01 00:08:24 -0.209642
2011-01-01 00:08:25 -0.750103
2011-01-01 00:08:26 1.344343
2011-01-01 00:08:27 1.446148
2011-01-01 00:08:28 1.174443
Freq: S, dtype: float64,
2011-01-01 03:20:00 1.327026
2011-01-01 03:20:01 -0.431162
2011-01-01 03:20:02 -0.461407
2011-01-01 03:20:03 -1.330671
2011-01-01 03:20:04 -0.892480
2011-01-01 03:20:05 -0.323433
2011-01-01 03:20:06 2.520965
2011-01-01 03:20:07 0.140757
2011-01-01 03:20:08 -1.688278
2011-01-01 03:20:09 0.856346
2011-01-01 03:20:10 0.013968
2011-01-01 03:20:11 0.204514
2011-01-01 03:20:12 0.287756
2011-01-01 03:20:13 -0.727400
Freq: S, dtype: float64,
2011-01-01 05:33:20 -1.409744
2011-01-01 05:33:21 0.338251
2011-01-01 05:33:22 0.215555
2011-01-01 05:33:23 -0.309874
2011-01-01 05:33:24 0.753737
2011-01-01 05:33:25 -0.349966
2011-01-01 05:33:26 0.074758
2011-01-01 05:33:27 -1.574485
2011-01-01 05:33:28 0.595844
2011-01-01 05:33:29 -0.670004
2011-01-01 05:33:30 1.655479
2011-01-01 05:33:31 -0.362853
2011-01-01 05:33:32 0.167355
2011-01-01 05:33:33 0.703780
2011-01-01 05:33:34 2.633756
2011-01-01 05:33:35 1.898891
2011-01-01 05:33:36 -1.129365
2011-01-01 05:33:37 -0.765057
2011-01-01 05:33:38 0.279869
2011-01-01 05:33:39 1.388705
2011-01-01 05:33:40 -1.420761
2011-01-01 05:33:41 0.455692
2011-01-01 05:33:42 0.367106
2011-01-01 05:33:43 0.856598
2011-01-01 05:33:44 1.920748
2011-01-01 05:33:45 0.648581
2011-01-01 05:33:46 -0.606784
2011-01-01 05:33:47 -0.246285
2011-01-01 05:33:48 -0.040520
2011-01-01 05:33:49 1.422764
2011-01-01 05:33:50 -1.686568
2011-01-01 05:33:51 1.282430
2011-01-01 05:33:52 1.358482
2011-01-01 05:33:53 -0.998765
2011-01-01 05:33:54 -0.009527
2011-01-01 05:33:55 0.647671
2011-01-01 05:33:56 -1.098435
2011-01-01 05:33:57 -0.638245
2011-01-01 05:33:58 -1.820668
2011-01-01 05:33:59 0.768250
2011-01-01 05:34:00 -1.029975
2011-01-01 05:34:01 -0.744205
2011-01-01 05:34:02 1.627130
2011-01-01 05:34:03 2.058689
2011-01-01 05:34:04 -1.194971
2011-01-01 05:34:05 1.293214
2011-01-01 05:34:06 0.029523
2011-01-01 05:34:07 -0.405785
2011-01-01 05:34:08 0.837123
Freq: S, dtype: float64]
【讨论】:
以上是关于如何按 NAN 值拆分熊猫时间序列的主要内容,如果未能解决你的问题,请参考以下文章