过滤具有最小窗口长度的连续序列的 pandas 或 numpy 数组

Posted 2023-03-11

技术标签:

【中文标题】过滤具有最小窗口长度的连续序列的 pandas 或 numpy 数组【英文标题】：Filtering pandas or numpy arrays for continuous series with minimum window length 【发布时间】：2016-04-09 13:24:07 【问题描述】：

我想过滤numpyarray（或pandasDataFrame），只保留至少具有window_size长度的相同值的连续序列，其他所有设置为0 .

例如：

[1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1]

当使用窗口大小为 4 时应该变为

[0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1]

我尝试过使用rolling_apply 和scipy.ndimage.filtes.gerneric_filter，但由于滚动内核函数的性质，我认为这不是正确的方法（我现在坚持使用它）。

我还是在这里插入我的尝试：

import numpy as np
import pandas as pd
import scipy
#from scipy import ndimage
df= pd.DataFrame('x':np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1]))
df_alt = df.copy()
def filter_df(df, colname, window_size):
    rolling_func = lambda z: z.sum() >= window_size
    df[colname] = pd.rolling_apply(df[colname],
                                    window_size,
                                    rolling_func,
                                    min_periods=window_size/2,
                                    center = True) 

def filter_alt(df, colname, window_size):
    rolling_func = lambda z: z.sum() >= window_size
    return scipy.ndimage.filters.generic_filter(df[colname].values,
                                                rolling_func,
                                                size = window_size,                                       
                                                origin = 0)

window_size = 4
filter_df(df, 'x', window_size)
print df
filter_alt(df_alt, 'x', window_size)

【问题讨论】：

您希望如何处理比窗口大小更长的相同值序列？这些值是否总是相同的，或者对于同一个数组它们可以不同吗？我也想将它们保留为一系列 1。喜欢：[1,1,1,1,1] -> [1,1,1,1,1] 【参考方案1】：

这基本上是一个image closing operation in image-processing 用于一维案例。这样的操作可以用卷积方法来实现。现在，NumPy does support 1D convolution，我们很幸运！因此，要解决我们的问题，应该是这样的 -

def conv_app(A, WSZ):
    K = np.ones(WSZ,dtype=int)
    L = WSZ-1
    return (np.convolve(np.convolve(A,K)>=WSZ,K)[L:-L]>0).astype(int)

示例运行 -

In [581]: A
Out[581]: array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])

In [582]: conv_app(A,4)
Out[582]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

In [583]: A = np.append(1,A) # Append 1 and see what happens!

In [584]: A
Out[584]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])

In [585]: conv_app(A,4)
Out[585]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

运行时测试 -

本部分对列出的解决所发布问题的其他几种方法进行了基准测试。它们的定义如下 -

def groupby_app(A,WSZ): # @lambo477's solution
    groups = itertools.groupby(A)
    result = []
    for group in groups:
        group_items = [item for item in group[1]]
        group_length = len(group_items)
        if group_length >= WSZ:
            result.extend([item for item in group_items])
        else:
            result.extend([0]*group_length)
    return result

def stride_tricks_app(arr, window): # @ajcr's solution
    x = pd.rolling_min(arr, window)
    x[:window-1] = 0
    y = np.lib.stride_tricks.as_strided(x, (len(x)-window+1, window), (8, 8))
    y[y[:, -1] == 1] = 1
    return x.astype(int)

时间安排 -

In [541]: A = np.random.randint(0,2,(100000))

In [542]: WSZ = 4

In [543]: %timeit groupby_app(A,WSZ)
10 loops, best of 3: 74.5 ms per loop

In [544]: %timeit stride_tricks_app(A,WSZ)
100 loops, best of 3: 3.35 ms per loop

In [545]: %timeit conv_app(A,WSZ)
100 loops, best of 3: 2.82 ms per loop

【讨论】：

我早该知道你会找到一个快速的方法！我简单地考虑了卷积，但没有考虑过两次应用它。很好的解决方案。【参考方案2】：

您可以按如下方式使用itertools.groupby：

import itertools
import numpy as np

my_array = np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])
window_size = 4

groups = itertools.groupby(my_array)

result = []
for group in groups:
    group_items = [item for item in group[1]]
    group_length = len(group_items)
    if group_length >= window_size:
        result.extend([item for item in group_items])
    else:
        result.extend([0]*group_length)

print(result)

输出

[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

【讨论】：

这也是一个很好的解决方案，我相信比我提供的要快一点。谢谢lambo 我也会测试你的解决方案，看看哪个效果更好。 itertools groupby 在我相当大的数据帧上速度惊人。我针对 johnchase 解决方案测试了 10000 次迭代，它在内存和运行时方面表现得更好。所以我会认为这是更好的解决方案。虽然 john's 看起来相当不错，但缺乏性能。不过还是谢谢！【参考方案3】：

这是使用pd.rolling_min 和跨步技巧的一种方法：

def func(arr, window):
    x = pd.rolling_min(arr, window)
    x[:window-1] = 0
    y = np.lib.stride_tricks.as_strided(x, (len(x)-window+1, window), (8, 8))
    y[y[:, -1] == 1] = 1
    return x.astype(int)

那么我们有：

>>> x = np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])
>>> func(x, 4)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])
>>> y = np.array([1,1,1,0,0,1,1,1,1,1,0,1,0,0,0,1,1,1,0,1,1,1,1]) # five 1s
>>> func(y, 4)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

在大型阵列上，这种方法相当快（在我的系统上 groupby 大约慢 20 倍）：

>>> x = np.random.randint(0, 2, size=1000000)
>>> %timeit func(x, 4)
10 loops, best of 3: 24.4 ms per loop

【讨论】：

是的 ajcr 我试过了，它是最慢的。 itertools 解决方案是迄今为止最快的。在我的数据集上，每个循环大约需要 13.9 毫秒，Johns 每个循环大约需要 153 毫秒，而你的大约需要 3 秒。谢谢。 @pho：在我询问之后，我将 rolling_apply 替换为 rolling_min（我粘贴了错误的功能 - 抱歉）。在任何更大的数据集上它应该会明显更快。我在修改后的 groupby 函数中也有一个错误，这说明它是如此之快。明天再测试一遍。谢谢ajcr @pho 添加了运行时测试here。 Groupby 方法确实比 strides 方法慢很多。【参考方案4】：

可能会有更好的解决方案，但我认为这应该可行：

In [90]: x = np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,1,2,1,4,4,4,4,4,0,1,1,1,1])

我在其中添加了一些其他数字，以防您需要代码来说明这一点；

In [93]: y = np.split(x, np.where(np.diff(x) != 0)[0]+1)
         z = [list(e) if len(e) >= 4 else [0]*len(e) for e in y]
         result = np.array([item for sublist in z for item in sublist])

这里的第一行是将原始数组拆分为连续字符，第二行将包含少于 4 个连续字符的任何项目替换为 0，最后一行将拆分列表展平。

In [96]: result
Out[96]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

解决方案的第一行还大量使用了之前的SO answer

【讨论】：

感谢您的回复。这看起来很优雅，我会在我的数据集上尝试一下，看看它的表现如何。【参考方案5】：

itertools.groupby 解决方案的更紧凑的变体：

window_size = 4
groups = [list(g) for k, g in itertools.groupby(my_array)]
filtered_array = [g if sum(g) >= window_size else [0]*len(g) for g in groups]
[int(i) for sub in filtered_array for i in sub]

[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

【讨论】：

以上是关于过滤具有最小窗口长度的连续序列的 pandas 或 numpy 数组的主要内容，如果未能解决你的问题，请参考以下文章