为 Python 查找最长重复字符串的有效方法（来自 Programming Pearls）

Posted 2023-02-19

技术标签:

【中文标题】为 Python 查找最长重复字符串的有效方法（来自 Programming Pearls）【英文标题】：Effcient way to find longest duplicate string for Python (From Programming Pearls) 【发布时间】：2012-11-13 15:29:57 【问题描述】：

来自编程珍珠的第 15.2 节

C代码可以看这里：http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c

当我使用 suffix-array 在 Python 中实现它时：

example = open("iliad10.txt").read()
def comlen(p, q):
    i = 0
    for x in zip(p, q):
        if x[0] == x[1]:
            i += 1
        else:
            break
    return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:]))  #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
    this_len = comlen(example[idx[i]:], example[idx[i+1]:])
    print this_len
    if this_len > max_len:
        max_len = this_len
        maxi = i

我发现idx.sort 步骤非常慢。我认为这很慢，因为 Python 需要按值而不是按指针传递子字符串（如上面的 C 代码）。

测试文件可以从here下载

C 代码只需 0.3 秒即可完成。

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away. 

real    0m0.328s
user    0m0.291s
sys 0m0.006s

但是对于 Python 代码，它永远不会在我的计算机上结束（我等了 10 分钟并杀死了它）

有没有人知道如何使代码高效？（例如，少于 10 秒）

【问题讨论】：

C 代码需要多长时间？您的代码需要多长时间？ @tjameson C 代码使用 0.3 秒。我不知道我的代码需要多长时间，因为它永远不会结束（至少 10 分钟）.. C 代码很慢，因为它在排序时无法跟踪“迄今为止的最长匹配”，并且必须再次检查所有内容。出于同样的原因，Python 很慢，另外因为它是对字符串而不是指向字符串的指针进行操作，另外因为它是 Python。 example[a:] 每次复制一个字符串 (O(N))。所以你的排序是O(N*N*logN)。对于 iliad，~10**12 操作很慢。由于 Programming Swines，err，sorry Pearls，严重依赖于各种形式的未定义、未指定和 imp.defined 行为，因此您无法轻松地将代码从它翻译成另一种不同的语言未指定的行为。 【参考方案1】：

我的解决方案是基于后缀数组。它由前缀加倍最长公共前缀构成。最坏情况的复杂度为 O(n (log n)^2)。文件“iliad.mb.txt”在我的笔记本电脑上需要 4 秒。 longest_common_substring 函数很短，可以轻松修改，例如用于搜索 10 个最长的非重叠子串。如果重复字符串的长度超过 10000 个字符，则此 Python 代码比问题中的 original C code 更快。

from itertools import groupby
from operator import itemgetter

def longest_common_substring(text):
    """Get the longest common substrings and their positions.
    >>> longest_common_substring('banana')
    'ana': [1, 3]
    >>> text = "not so Agamemnon, who spoke fiercely to "
    >>> sorted(longest_common_substring(text).items())
    [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]

    This function can be easy modified for any criteria, e.g. for searching ten
    longest non overlapping repeated substrings.
    """
    sa, rsa, lcp = suffix_array(text)
    maxlen = max(lcp)
    result = 
    for i in range(1, len(text)):
        if lcp[i] == maxlen:
            j1, j2, h = sa[i - 1], sa[i], lcp[i]
            assert text[j1:j1 + h] == text[j2:j2 + h]
            substring = text[j1:j1 + h]
            if not substring in result:
                result[substring] = [j1]
            result[substring].append(j2)
    return dict((k, sorted(v)) for k, v in result.items())

def suffix_array(text, _step=16):
    """Analyze all common strings in the text.

    Short substrings of the length _step a are first pre-sorted. The are the 
    results repeatedly merged so that the garanteed number of compared
    characters bytes is doubled in every iteration until all substrings are
    sorted exactly.

    Arguments:
        text:  The text to be analyzed.
        _step: Is only for optimization and testing. It is the optimal length
               of substrings used for initial pre-sorting. The bigger value is
               faster if there is enough memory. Memory requirements are
               approximately (estimate for 32 bit Python 3.3):
                   len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB

    Return value:      (tuple)
      (sa, rsa, lcp)
        sa:  Suffix array                  for i in range(1, size):
               assert text[sa[i-1]:] < text[sa[i]:]
        rsa: Reverse suffix array          for i in range(size):
               assert rsa[sa[i]] == i
        lcp: Longest common prefix         for i in range(1, size):
               assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
               if sa[i-1] + lcp[i] < len(text):
                   assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
    >>> suffix_array(text='banana')
    ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])

    Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
    The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
    It is between  tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
    """
    tx = text
    size = len(tx)
    step = min(max(_step, 1), len(tx))
    sa = list(range(len(tx)))
    sa.sort(key=lambda i: tx[i:i + step])
    grpstart = size * [False] + [True]  # a boolean map for iteration speedup.
    # It helps to skip yet resolved values. The last value True is a sentinel.
    rsa = size * [None]
    stgrp, igrp = '', 0
    for i, pos in enumerate(sa):
        st = tx[pos:pos + step]
        if st != stgrp:
            grpstart[igrp] = (igrp < i - 1)
            stgrp = st
            igrp = i
        rsa[pos] = igrp
        sa[i] = pos
    grpstart[igrp] = (igrp < size - 1 or size == 0)
    while grpstart.index(True) < size:
        # assert step <= size
        nextgr = grpstart.index(True)
        while nextgr < size:
            igrp = nextgr
            nextgr = grpstart.index(True, igrp + 1)
            glist = []
            for ig in range(igrp, nextgr):
                pos = sa[ig]
                if rsa[pos] != igrp:
                    break
                newgr = rsa[pos + step] if pos + step < size else -1
                glist.append((newgr, pos))
            glist.sort()
            for ig, g in groupby(glist, key=itemgetter(0)):
                g = [x[1] for x in g]
                sa[igrp:igrp + len(g)] = g
                grpstart[igrp] = (len(g) > 1)
                for pos in g:
                    rsa[pos] = igrp
                igrp += len(g)
        step *= 2
    del grpstart
    # create LCP array
    lcp = size * [None]
    h = 0
    for i in range(size):
        if rsa[i] > 0:
            j = sa[rsa[i] - 1]
            while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
                h += 1
            lcp[rsa[i]] = h
            if h > 0:
                h -= 1
    if size > 0:
        lcp[0] = 0
    return sa, rsa, lcp

比起more complicated O(n log n)，我更喜欢这个解决方案，因为Python 有一个非常快速的列表排序算法(Timsort)。 Python 的排序可能比那篇文章中的方法中必要的线性时间操作要快，在随机字符串和小字母表（典型用于 DNA 基因组分析）的非常特殊的假设下，这应该是 O(n)。我在Gog 2011 中读到，我的算法的最坏情况 O(n log n) 实际上比许多不能使用 CPU 内存缓存的 O(n) 算法更快。

如果文本包含 8 kB 长的重复字符串，则基于 grow_chains 的另一个答案中的代码比问题中的原始示例慢 19 倍。长时间重复的文本对于古典文学来说并不典型，但它们很常见，例如在“独立”学校作业收藏中。程序不应冻结它。

我为 Python 2.7、3.3 - 3.6 写了 an example and tests with the same code。

【讨论】：

上面带有测试的示例链接已损坏。你能更新一下吗？我通过粘贴我的副本修复了指向我的代码和原始 C 的链接。【参考方案2】：

主要问题似乎是python通过复制进行切片：https://***.com/a/5722068/538551

您必须使用memoryview 来获取引用而不是副本。当我这样做时，程序在idx.sort 函数之后挂起（非常快）。

我相信只要做一点工作，你就可以让剩下的工作。

编辑：

~~上述更改不能作为直接替换，因为cmp 的工作方式与strcmp 不同。例如，试试下面的 C 代码：~~

#include <stdio.h>
#include <string.h>

int main() 
    char* test1 = "ovided by The Internet Classics Archive";
    char* test2 = "rovided by The Internet Classics Archive.";
    printf("%d\n", strcmp(test1, test2));

并将结果与此 python 进行比较：

test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))

C 代码在我的机器上打印-3，而python 版本打印-1。看起来示例 C 代码正在滥用 strcmp 的返回值（毕竟它在 qsort 中使用过）。我找不到任何有关strcmp 何时返回[-1, 0, 1] 以外的内容的文档，但在原始代码中将printf 添加到pstrcmp 显示了许多超出该范围的值（3、-31、 5 是前 3 个值）。

为了确保-3 不是一些错误代码，如果我们反转test1 和test2，我们将得到3。

编辑：

以上是有趣的琐事，但在影响任一代码块方面实际上并不正确。当我关闭笔记本电脑并离开 wifi 区域时，我意识到了这一点……在我点击 Save 之前真的应该仔细检查所有内容。

FWIW，cmp 肯定适用于 memoryview 对象（按预期打印 -1）：

print(cmp(memoryview(test1), memoryview(test2)))

我不确定为什么代码没有按预期工作。在我的机器上打印出列表看起来不像预期的那样。我会对此进行研究并尝试找到更好的解决方案，而不是抓住稻草。

【讨论】：

谢谢，tjameson！但是即使使用memoryview，您仍然需要将字符串传递给cmp，对吧？那还需要传值吗？这个不行。由于cmp 不能用于memoryview 对象 Bentley 的代码不滥用strcmp。它只是用它来比较qsort 中的字符串，而这又不依赖于返回值的符号。 @larsmans - 正如我的评论中提到的，我在发布后大约 5 分钟意识到了这一点。就在我停止盯着代码看的时候......修改答案。 memoryview 比较不起作用。请参阅my answer 中的示例【参考方案3】：

算法到 Python 的翻译：

from itertools import imap, izip, starmap, tee
from os.path   import commonprefix

def pairwise(iterable): # itertools recipe
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def longest_duplicate_small(data):
    suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
    return max(imap(commonprefix, pairwise(suffixes)), key=len)

buffer() 允许在不复制的情况下获取子字符串：

def longest_duplicate_buffer(data):
    n = len(data)
    sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
    def lcp_item(i, j):  # find longest common prefix array item
        start = i
        while i < n and data[i] == data[i + j - start]:
            i += 1
        return i - start, start
    size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
    return data[start:start + size]

iliad.mb.txt 在我的机器上需要 5 秒钟。

原则上，可以使用suffix array 和lcp array 来在 O(n) 时间和 O(n) 内存中找到重复项。

^{注意：*_memoryview() 已被 *_buffer() 版本弃用}

内存效率更高的版本（与longest_duplicate_small()相比）：

def cmp_memoryview(a, b):
    for x, y in izip(a, b):
        if x < y:
            return -1
        elif x > y:
            return 1
    return cmp(len(a), len(b))

def common_prefix_memoryview((a, b)):
    for i, (x, y) in enumerate(izip(a, b)):
        if x != y:
            return a[:i]
    return a if len(a) < len(b) else b

def longest_duplicate(data):
    mv = memoryview(data)
    suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
    result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
    return result.tobytes()

iliad.mb.txt 在我的机器上需要 17 秒。结果是：

在这一点上，其余的阿开亚人以一种声音表示尊重祭司取了他所献的赎价；但不是阿伽门农，谁对他恶狠狠地说话，粗暴地把他打发走了。

我必须定义自定义函数来比较 memoryview 对象，因为 memoryview 比较要么在 Python 3 中引发异常，要么在 Python 2 中产生错误结果：

>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True

相关问题：

Find the longest repeating string and the number of times it repeats in a given string

finding long repeated substrings in a massive string

【讨论】：

由于您的代码需要 python 3.+，而我目前无法访问该版本，能否请您也提供我的代码版本在您的环境中的运行时间？ @lenik：代码适用于 Python 2.7。是什么让您认为它适用于 Python 3？能否请您不要再为无关的事情争论，只提供运行时间？ @lenik：如果你不能同时运行 Python 2.7 和 3。这是运行时间：12 秒。旁注：它在 Python 2 上产生不正确结果（在 Py3 上例外）的原因是 memoryview 只定义了 __eq__ 和 __ne__ 的等价物，而不是其余的丰富的比较运算符；在 Py2 上，这意味着它会进行最后的比较（最终比较对象的内存地址，完全没用），而 Python 3 会通知您不支持比较。有a bug open to fix this，但在过去五年中没有看到任何行动。【参考方案4】：

这个版本在我大约 2007 年的桌面上使用完全不同的算法大约需要 17 秒：

#!/usr/bin/env python

ex = open("iliad.mb.txt").read()

chains = dict()

# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
    s = ''.join(b)
    if s not in chains :
        chains[s] = list()

    chains[s].append(a)

def grow_chains(chains) :
    new_chains = dict()
    for (string,pos) in chains :
        offset = len(string)
        for p in pos :
            if p + offset >= len(ex) : break

            # add one more character
            s = string + ex[p + offset]

            if s not in new_chains :
                new_chains[s] = list()

            new_chains[s].append(p)
    return new_chains

# grow and filter, grow and filter
while len(chains) > 1 :
    print 'length of chains', len(chains)

    # remove chains that appear only once
    chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]

    print 'non-unique chains', len(chains)
    print [i[0] for i in chains[:3]]

    chains = grow_chains(chains)

基本思想是创建一个子字符串列表和它们出现的位置，从而消除了一次又一次比较相同字符串的需要。结果列表看起来像[('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]。唯一的字符串被删除。然后每个列表成员增长 1 个字符并创建新列表。唯一的字符串再次被删除。等等等等……

【讨论】：

如果多个重复子串具有相同的最大长度，则不返回任何内容。示例：ex = 'ABCxABCyDEFzDEF' @hynekcer 最后一组总是空的（这是循环停止条件），但之前的一组包含：['ABC', 'DEF']——我不明白为什么这是错误的？我的代码有明显的限制——只打印 3 个第一链，如果有更多——你必须修改代码或其他东西，漂亮的打印从来不是我的目标。我希望结果最终会出现在链变量中，但它们丢失了。调试打印对于算法来说并不重要。 @hynekcer 调试打印有助于理解它是如何工作的。如果您只需要答案 - 将过滤结果保存在临时变量中并且当它为空时 - 打印您在 chains 中的任何内容 - 这对于任意数量的任意长度的子字符串应该都可以正常工作。最大的问题是你的算法可能需要超过N * N / 4字节的内存，其中N是输入字符串的长度。示例：ex = ' '.join('%03s' % i for i in range(500)) 我打印sum(len(string) for string in chains)，我看到最大值是1001000。所需时间与N * N * N 成正比。

以上是关于为 Python 查找最长重复字符串的有效方法（来自 Programming Pearls）的主要内容，如果未能解决你的问题，请参考以下文章