如何改进 Python 中列表的模式匹配

Posted 2023-02-25

技术标签:

【中文标题】如何改进 Python 中列表的模式匹配【英文标题】：How to improve the pattern matching on a list in Python 【发布时间】：2021-03-25 22:27:58 【问题描述】：

我有一个很大的列表，其中可能包含数千到数百万个条目。我设置了一个有限大小的窗口以在列表上滑动。我需要计算窗口中匹配的元素，并通过一次向前滑动窗口 1 位置来重复该过程。这是一个简单的列表示例

L = [1 2 1 3 4 5 1 2 1 2 2 2 3 ]

假设窗口的长度为3，它将捕获

[1 2 1] 包含一对匹配元素 (1 & 1) 将窗口向前移动 1 个位置 => [2 1 3]，没有匹配的元素将窗口向前移动 1 个位置 => [1 3 4]，没有匹配的元素将窗口向前移动 1 个位置 => [3 4 5]，没有匹配的元素将窗口向前移动 1 个位置 => [4 5 1]，没有匹配的元素将窗口向前移动 1 个位置 => [5 1 2]，没有匹配的元素将窗口向前移动 1 个位置 => [1 2 1]，1 个匹配元素 (1 & 1) 将窗口向前移动 1 个位置 => [2 1 2]，1 个匹配元素 (2 & 2) 将窗口向前移动 1 个位置 => [1 2 2]，1 个匹配元素 (2 & 2) 将窗口向前移动 1 个位置 => [2 2 2]，3 个匹配元素（[2 2 -]、[2 - 2]、[- 2 2]）将窗口向前移动 1 个位置 => [2 2 3]，1 个匹配元素 (2 & 2)

所以总共有 1 + 1 + 1 + 1 + 3 + 1 = 8 个匹配对。我找到了使用 itertools 查找窗口中所有元素的组合并开发代码以查找所有匹配对的想法

import itertools
L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3
totalMatch = 0
for n in range(len(L)-winlen+1):
    window = [L[n+i] for i in range(winlen)]
    A = list(itertools.combinations(window, 2))
    match = [a==b for a, b in A]
    totalMatch += sum(match)

它适用于一个简短的列表，但对于列表和窗口变大，这段代码太慢了。我已经使用 C++ 多年，并决定切换到 python，如果有任何提高代码效率的提示，我将不胜感激。

【问题讨论】：

用于数据分析，找出大量数据的匹配模式。每个数据点都是系统随时间推移的某些状态的记录。目的是在一定的时间（windows）内找到匹配的状态。你只是匹配数字，对吧？是的，都是数字（正整数） @JanChristophTerasa 啊......这给了我一些方向，我不熟悉如何做到这一点，但会先做一些研究来了解这个想法；） @learning2learn，它要求匹配窗口内任何可能的对，因此，对于 [2,2,2]，它可能是 [2, 2, ], [2, *, 2] 和 [, 2, 2]，一共3种可能。如果是为[2,2,2,5]，还是3个匹配对[2,2,,],[2,,2,], [,2,2,5];对于 [2,2,2,2] 可以是 [2,2,,], [2,,2,], [2, *, *, 2],[,2,2,],[,2,,2],[,*,2,2]，共6个匹配对 【参考方案1】：

更有效地跟踪窗口中的数据？这是 O(|L|) 而不是你的 O(|L|*winlen^2)。它将窗口的元素计数保存在ctr 中，并将窗口的匹配项保存在match 中。例如，当一个新值进入窗口并且窗口中已经存在该值的两个实例时，您将获得两个新匹配项。类似地，对于掉出窗口的值，它需要与它在窗口中的其他实例一样多的匹配。

from collections import Counter

L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3

totalMatch = match = 0
ctr = Counter()
for i, x in enumerate(L):
    
    # Remove old element falling out of window
    if i >= winlen:
        ctr[L[i-winlen]] -= 1
        match -= ctr[L[i-winlen]]

    # Add new element to window
    match += ctr[x]
    ctr[x] += 1

    # Update the total (for complete windows)
    if i >= winlen - 1:
        totalMatch += match

print(totalMatch)

L 和 winlen 的基准结果乘以 20：

 38.75 ms  original
  0.18 ms  Manuel

 38.73 ms  original
  0.19 ms  Manuel

 38.87 ms  original
  0.18 ms  Manuel

基准代码（还包括所有长度为 0 到 9 的数字 1 到 3 列表的测试代码）：

from timeit import repeat
import itertools
from itertools import product
from collections import Counter

def original(L, winlen):
    totalMatch = 0
    for n in range(len(L)-winlen+1):
        window = [L[n+i] for i in range(winlen)]
        A = list(itertools.combinations(window, 2))
        match = [a==b for a, b in A]
        totalMatch += sum(match)
    return totalMatch

def Manuel(L, winlen):
    totalMatch = match = 0
    ctr = Counter()
    for i, x in enumerate(L):
        if i >= winlen:
            ctr[L[i-winlen]] -= 1
            match -= ctr[L[i-winlen]]
        match += ctr[x]
        ctr[x] += 1
        if i >= winlen - 1:
            totalMatch += match
    return totalMatch

def test():
    for n in range(10):
        print('testing', n)
        for T in product([1, 2, 3], repeat=n):
            L = list(T)
            winlen = 3
            expect = original(L, winlen)
            result = Manuel(L, winlen)
            assert result == expect, (L, expect, result)

def bench():
    L = [1,2,1,3,4,5,1,2,1,2,2,2,3] * 20
    winlen = 3 * 20
    for _ in range(3):
        for func in original, Manuel:
            t = min(repeat(lambda: func(L, winlen), number=1))
            print('%6.2f ms ' % (t * 1e3), func.__name__)
        print()

test()
bench()

【讨论】：

这并没有快多少，如果有的话。 @jbflow 是什么让你这么认为？显然是这样。我的错误是我在计时时将打印件留在了那里。它的输入速度大约是原来的两倍 @jbflow 它的速度是原来的两倍多。你不能为它计时“列表和窗口变大”。 @MarkM 现在应该修复，添加了测试代码。谢谢。【参考方案2】：

您可以在for循环中使用np.bincount，确定每个数字/bin的组合数，并将其与总数相加。

import numpy as np

L = [1, 2, 1, 3, 4, 5, 1, 2, 1, 2, 2, 2, 3]
winlen = 3

L = np.array(L) # convert to array to speed up indexing

total = 0
for i in range(len(L) - winlen + 1):
    bc = np.bincount(L[i:i+winlen]) # bincount on the window
    bc = bc[bc>1] # get rid of all single and empty values
    bc = bc * (bc-1) // 2 # gauss addition, number of combinations of each number
    total += np.sum(bc) # sum up combinations, and add to total

print(total)
# 8

【讨论】：

这很有趣。我正在阅读所有用于了解其工作原理的命令请注意，这比 @Manuel 的大型输入序列解决方案慢了大约 10 倍。我们有同样的复杂性吗？（我不太了解 numpy（指向bincount 文档的链接会有所帮助:-)） bincount 本质上类似于Counter，但没有字典开销。它计算一个序列中整数出现的次数，因此需要对其进行迭代并保持计数。这应该是 O(n * m)，其中 n 是序列长度，m 是窗口长度。它始终比您的解决方案慢得多，但比原来的解决方案快。啊，我看到L[i:i+winlen] 已经使它的复杂性变得更糟了。因此，不清楚您所说的“慢 10 倍”是什么意思，因为只有在复杂度类相同时才有意义。

以上是关于如何改进 Python 中列表的模式匹配的主要内容，如果未能解决你的问题，请参考以下文章

在python中，如何通过匹配原始列表中的字符串模式从字符串列表中提取子列表

如何找到输入列表的匹配模式，然后使用 python 用正确的模式转换替换找到的模式

Python中列表的模式匹配

尝试在 JavaScript 中使用 Gruber 的“改进的”URL 匹配正则表达式模式时，如何修复“无效组”错误？

python中列表中的模式匹配

如何在模式匹配中拆分列表？