Pandas - 查找和索引与行序列模式匹配的行

Posted 2023-03-12

技术标签:

【中文标题】Pandas - 查找和索引与行序列模式匹配的行【英文标题】：Pandas - Find and index rows that match row sequence pattern 【发布时间】：2018-07-20 12:23:21 【问题描述】：

我想在数据框中的分类变量中找到一个模式，该模式沿行向下。我可以看到如何使用 Series.shift() 向上/向下查找并使用布尔逻辑来查找模式，但是，我想使用分组变量来执行此操作，并标记作为模式一部分的所有行，而不仅仅是起始行。

代码：

import pandas as pd
from numpy.random import choice, randn
import string

# df constructor
n_rows = 1000
df = pd.DataFrame('date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
                   'group_var': choice(list(string.ascii_uppercase), n_rows),
                   'row_pat': choice([0, 1, 2, 3], n_rows),
                   'values': randn(n_rows))

# sorting 
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)

返回这个：

我可以通过这个找到模式的开始（虽然没有分组）：

# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0 

# flag the row at the start of the pattern
df['pat_flag'] = \
df['row_pat'].eq(p0) & \
df['row_pat'].shift(-1).eq(p1) & \
df['row_pat'].shift(-2).eq(p2) & \
df['row_pat'].shift(-3).eq(p3)

df.head(10)

我想不通的是，如何仅使用“group_var”来执行此操作，而不是为模式的开头返回 True，而是为属于模式的所有行返回 true。

感谢有关如何解决此问题的任何提示！

谢谢...

【问题讨论】：

【参考方案1】：

我认为您有两种方法 - 更简单、更慢的解决方案或更快更复杂的解决方案。

使用Rolling.apply 和测试模式将0s 替换为NaNs 为mask 使用bfill 和limit（与fillna 和method='bfill' 相同）重复使用1 然后fillna NaNs 到0 astype 最后一次转换为布尔值

pat = np.asarray([1, 2, 2, 0])
N = len(pat)


df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
                          .apply(lambda x: (x==pat).all())
                          .mask(lambda x: x == 0) 
                          .bfill(limit=N-1)
                          .fillna(0)
                          .astype(bool)
             )

如果是重要的性能，请使用strides，link 的解决方案已修改：

使用rolling window 方法与 pattaern 比较并返回 Trues 以匹配 all 通过np.mgrid 和索引获取首次出现的索引使用列表理解创建所有索引比较numpy.in1d并创建新列

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)

另一种解决方案，谢谢@divakar：

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))

时间安排：

np.random.seed(456) 

import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string

# df constructor
n_rows = 100000
df = pd.DataFrame('date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
                   'group_var': choice(list(string.ascii_uppercase), n_rows),
                   'row_pat': choice([0, 1, 2, 3], n_rows),
                   'values': randn(n_rows))

# sorting 
df.sort_values(by=['group_var', 'date_time'], inplace=True)

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c


arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))

arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)

print (df.iloc[460:480])

                date_time group_var  row_pat    values    rm0    rm1    rm2
12045 2019-06-25 21:00:00         A        3 -0.081152  False  False  False
12094 2019-06-27 22:00:00         A        1 -0.818167  False  False  False
12125 2019-06-29 05:00:00         A        0 -0.051088  False  False  False
12143 2019-06-29 23:00:00         A        0 -0.937589  False  False  False
12145 2019-06-30 01:00:00         A        3  0.298460  False  False  False
12158 2019-06-30 14:00:00         A        1  0.647161  False  False  False
12164 2019-06-30 20:00:00         A        3 -0.735538  False  False  False
12210 2019-07-02 18:00:00         A        1 -0.881740  False  False  False
12341 2019-07-08 05:00:00         A        3  0.525652  False  False  False
12343 2019-07-08 07:00:00         A        1  0.311598  False  False  False
12358 2019-07-08 22:00:00         A        1 -0.710150   True   True   True
12360 2019-07-09 00:00:00         A        2 -0.752216   True   True   True
12400 2019-07-10 16:00:00         A        2 -0.205122   True   True   True
12404 2019-07-10 20:00:00         A        0  1.342591   True   True   True
12413 2019-07-11 05:00:00         A        1  1.707748  False  False  False
12506 2019-07-15 02:00:00         A        2  0.319227  False  False  False
12527 2019-07-15 23:00:00         A        3  2.130917  False  False  False
12600 2019-07-19 00:00:00         A        1 -1.314070  False  False  False
12604 2019-07-19 04:00:00         A        0  0.869059  False  False  False
12613 2019-07-19 13:00:00         A        2  1.342101  False  False  False

In [225]: %%timeit
     ...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
     ...:                           .apply(lambda x: (x==pat).all())
     ...:                           .mask(lambda x: x == 0) 
     ...:                           .bfill(limit=N-1)
     ...:                           .fillna(0)
     ...:                           .astype(bool)
     ...:              )
     ...: 
1 loop, best of 3: 356 ms per loop

In [226]: %%timeit
     ...: arr = df['row_pat'].values
     ...: b = np.all(rolling_window(arr, N) == pat, axis=1)
     ...: c = np.mgrid[0:len(b)][b]
     ...: d = [i  for x in c for i in range(x, x+N)]
     ...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
     ...: 
100 loops, best of 3: 7.63 ms per loop

In [227]: %%timeit
     ...: arr = df['row_pat'].values
     ...: b = np.all(rolling_window(arr, N) == pat, axis=1)
     ...: 
     ...: m = (rolling_window(arr, len(pat)) == pat).all(1)
     ...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
     ...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
     ...: 
100 loops, best of 3: 7.25 ms per loop

【讨论】：

将赏金授予@jezrael，因为它为模式的所有成员正确设置了标志，而不仅仅是开始。它还包括 3 种方法，以及每种方法的时间安排。由于在我的情况下可能有 100 万行，因此替代方法将很有用。再次感谢所有参与并提交回复的人！ @RandallGoodwin - 谢谢。这是一个非常有趣的问题，很高兴能提供帮助！每次我需要有关代码/语法的帮助时，我都会参考您的旧答案。它确实有助于我很好地理解熊猫，并以您的回答为基准为其他用户提供解决方案:) @Pygirl - 我也是；）有很多好的答案，但不容易找到；）【参考方案2】：

您可以使用 pd.rolling() 方法，然后简单地将它返回的数组与包含您尝试匹配的模式的数组进行比较。

pattern = np.asarray([1.0, 2.0, 2.0, 0.0])
n_obs = len(pattern)
df['rolling_match'] = (df['row_pat']
                       .rolling(window=n_obs , min_periods=n_obs)
                       .apply(lambda x: (x==pattern).all())
                       .astype(bool)             # All as bools
                       .shift(-1 * (n_obs - 1))  # Shift back
                       .fillna(False)            # convert NaNs to False
                       )

在此处指定最小周期很重要，以确保您只找到完全匹配（因此当形状未对齐时相等性检查不会失败）。 apply 函数在两个数组之间进行成对检查，然后我们使用 .all() 来确保所有匹配。我们转换为布尔值，然后在函数上调用 shift 以将其变为“前瞻性”指标，而不是仅在事后发生。

此处提供有关滚动功能的帮助 - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html

【讨论】：

【参考方案3】：

这行得通。它的工作原理是这样的： a) 对于每个组，它需要一个大小为 4 的窗口并扫描该列，直到它以精确的顺序找到组合 (1,2,2,0)。一旦找到序列，它就会用 1 填充新列 'pat_flag' 的相应索引值。 b) 如果没有找到组合，则用 0 填充列。

pattern = [1,2,2,0]
def get_pattern(df):

    df = df.reset_index(drop=True)
    df['pat_flag'] = 0

    get_indexes = [] 
    temp = []

    for index, row in df.iterrows():

        mindex = index +1

        # get the next 4 values
        for j in range(mindex, mindex+4):

            if j == df.shape[0]:
                break
            else:
                get_indexes.append(j)
                temp.append(df.loc[j,'row_pat'])

        # check if sequence is matched
        if temp == pattern:
            df.loc[get_indexes,'pat_flag'] = 1
        else:
            # reset if the pattern is not found in given window
            temp = []
            get_indexes = []

    return df

# apply function to the groups
df = df.groupby('group_var').apply(get_pattern)

## snippet of output 

        date_time       group_var   row_pat     values  pat_flag
41  2018-03-13 21:00:00      C         3       0.731114     0
42  2018-03-14 05:00:00      C         0       1.350164     0
43  2018-03-14 11:00:00      C         1      -0.429754     1
44  2018-03-14 12:00:00      C         2       1.238879     1
45  2018-03-15 17:00:00      C         2      -0.739192     1
46  2018-03-18 06:00:00      C         0       0.806509     1
47  2018-03-20 06:00:00      C         1       0.065105     0
48  2018-03-20 08:00:00      C         1       0.004336     0

【讨论】：

【参考方案4】：

扩展 Emmet02 的答案：对所有组使用滚动功能并将所有匹配模式索引的 match-column 设置为 1：

pattern = np.asarray([1,2,2,0])

# Create a match column in the main dataframe
df.assign(match=False, inplace=True)

for group_var, group in df.groupby("group_var"):

    # Per group do rolling window matching, the last 
    # values of matching patterns in array 'match'
    # will be True
    match = (
        group['row_pat']
        .rolling(window=len(pattern), min_periods=len(pattern))
        .apply(lambda x: (x==pattern).all())
    )

    # Get indices of matches in current group
    idx = np.arange(len(group))[match == True]

    # Include all indices of matching pattern, 
    # counting back from last index in pattern
    idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))

    # Update matches
    match.values[idx] = True
    df.loc[group.index, 'match'] = match

df[df.match==True]

编辑：没有 for 循环

# Do rolling matching in group clause
match = (
    df.groupby("group_var")
    .rolling(len(pattern))
    .row_pat.apply(lambda x: (x==pattern).all())
)

# Convert NaNs
match = (~match.isnull() & match)

# Get indices of matches in current group
idx = np.arange(len(df))[match]
# Include all indices of matching pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))

# Mark all indices that are selected by "idx" in match-column
df = df.assign(match=df.index.isin(df.index[idx]))

【讨论】：

感谢所有全面的回复！我将在周末进行测试，然后根据速度和编程简单性的结合来奖励赏金（我第一次使用赏金......似乎引起了一些关注！:)。【参考方案5】：

您可以通过定义自定义聚合函数，然后在 group_by 语句中使用它，最后将其合并回原始数据框来实现。像这样的：

聚合函数：

def pattern_detect(column):
 # define any other pattern to detect here
 p0, p1, p2, p3 = 1, 2, 2, 0       
 column.eq(p0) & \
 column.shift(-1).eq(p1) & \
 column.shift(-2).eq(p2) & \
 column.shift(-3).eq(p3)
 return column.any()

接下来使用按功能分组：

grp = df.group_by('group_var').agg([patter_detect])['row_pat']

现在将其合并回原始数据框：

df = df.merge(grp, left_on='group_var',right_index=True, how='left')

【讨论】：

以上是关于Pandas - 查找和索引与行序列模式匹配的行的主要内容，如果未能解决你的问题，请参考以下文章