过滤一组以匹配字符串排列

Posted 2023-03-28

技术标签:

【中文标题】过滤一组以匹配字符串排列【英文标题】：Filter a Set for Matching String Permutations 【发布时间】：2017-12-05 02:13:34 【问题描述】：

我正在尝试使用 itertools.permutations() 来返回 string 的所有排列，并只返回一组 的成员单词。

import itertools

def permutations_in_dict(string, words): 
    '''
    Parameters
    ----------
    string : str
    words : set

    Returns
    -------
    list : list of str    

    Example
    -------
    >>> permutations_in_dict('act', 'cat', 'rat', 'dog', 'act')
    ['act', 'cat']
    '''

我当前的解决方案在终端中运行良好，但不知何故无法通过测试用例...

return list(set([''.join(p) for p in itertools.permutations(string)]) & words)

任何帮助将不胜感激。

【问题讨论】：

测试用例到底是什么？如果您将结果与['act', 'cat'] 进行比较，也许您需要忽略排序并创建一个集合。确实我的输出是 ['cat','act'] 与 ['act','cat'] 不匹配。集合的顺序是随机的，对吧？那我怎么能忽略/匹配它呢？ @JacquesKvam 它是如何失败的？时间可能是个问题，创建string 的所有排列会随着字符串的长度而迅速爆炸。或者如果订单很重要，那么您可能需要sorted(...) 结果。我刚刚发布了对各种方法的比较分析。事实证明，对于小的len(string)，@Meruemu 使用 set-intersection 搜索目标字符串的排列是最快的方法。对于稍大一点的len(string)，排序和比较方法是最好的。由于散列的开销，Counter/multiset 解决方案在所有正常情况下都是次优的。但是，如果所有输入字符串都非常很大，则 Counter/multiset 方法最终会击败排序和比较。对结果进行排序是满足测试用例的一种方法。但是，如果需要，它应该在描述中说明。另一方面，如果结果的顺序是任意的，您可以更改测试用例以将集合应用于结果。无论哪种方式，您都应该寻求客户的澄清 【参考方案1】：

问题类别

您要解决的问题最好描述为测试anagram 匹配。

使用排序的解决方案

traditional solution是对目标字符串进行排序，对候选字符串进行排序，测试是否相等。

>>> def permutations_in_dict(string, words):
        target = sorted(string)
        return sorted(word for word in words if sorted(word) == target)

>>> permutations_in_dict('act', 'cat', 'rat', 'dog', 'act')
['act', 'cat']

使用多重集的解决方案

另一种方法是使用collections.Counter() 进行multiset 相等性测试。这在算法上优于排序解决方案（O(n) 与 O(n log n)）但往往会丢失，除非字符串的大小很大（由于散列所有字符的成本）。

>>> def permutations_in_dict(string, words):
        target = Counter(string)
        return sorted(word for word in words if Counter(word) == target)

>>> permutations_in_dict('act', 'cat', 'rat', 'dog', 'act')
['act', 'cat']

使用完美哈希的解决方案

一个独特的字谜签名或perfect hash可以通过将对应于字符串中每个可能字符的素数相乘来构造。

commutative property of multiplication 保证哈希值对于单个字符串的任何排列都是不变的。哈希值的唯一性由fundamental theorem of arithmetic（也称为唯一素数分解定理）保证。

>>> from operator import mul
>>> primes = [2, 3, 5, 7, 11]
>>> primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
>>> anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))
>>> def permutations_in_dict(string, words):
        target = anagram_hash(string)
        return sorted(word for word in words if anagram_hash(word) == target)

>>> permutations_in_dict('act', 'cat', 'rat', 'dog', 'act')
['act', 'cat']

使用排列的解决方案

当字符串很小时，使用itertools.permutations() 对目标字符串进行排列搜索是合理的（在 n 长度的字符串上生成排列会生成 n 个阶乘候选）。

好消息是，当 n 很小而 words 的数量很大时，这种方法运行得非常快（因为集合成员资格测试是 O(1)）：

>>> from itertools import permutations
>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(word for word in words if word in perms)

>>> permutations_in_dict('act', 'cat', 'rat', 'dog', 'act')
['act', 'cat']

正如 OP 推测的那样，纯 python 搜索循环可以通过使用 set.intersection() 加速到 c-speed：

>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(words & perms)

>>> permutations_in_dict('act', 'cat', 'rat', 'dog', 'act')
['act', 'cat']

最佳解决方案

哪种解决方案最好取决于 string 的长度和 words 的长度。计时将显示哪个最适合特定问题。

以下是使用两种不同字符串大小的各种方法的一些比较时序：

Timings with string_size=5 and words_size=1000000
-------------------------------------------------
0.01406    match_sort
0.06827    match_multiset
0.02167    match_perfect_hash
0.00224    match_permutations
0.00013    match_permutations_set

Timings with string_size=20 and words_size=1000000
--------------------------------------------------
2.19771    match_sort
8.38644    match_multiset
4.22723    match_perfect_hash
<takes "forever"> match_permutations
<takes "forever"> match_permutations_set

结果表明，对于小字符串，最快的方法是使用 set-intersection 在目标字符串上搜索排列。

对于较大的字符串，最快的方法是传统的排序和比较解决方案。

希望您发现这个小小的算法研究和我一样有趣。要点是：

集合、迭代工具和集合可以轻松解决此类问题。 Big-oh 运行时间很重要（大型 n 的 n 因子分解）。恒定的开销很重要（由于散列开销，排序优于多集）。离散数学是思想的宝库。在您进行分析和运行计时之前，很难知道什么是最好的 :-)

时间设置

FWIW，这是我用来运行比较时序的测试设置：

from collections import Counter
from itertools import permutations
from string import letters
from random import choice
from operator import mul
from time import time

def match_sort(string, words):
    target = sorted(string)
    return sorted(word for word in words if sorted(word) == target)

def match_multiset(string, words):
    target = Counter(string)
    return sorted(word for word in words if Counter(word) == target)

primes = [2, 3, 5, 7, 11]
primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))

def match_perfect_hash(string, words):
    target = anagram_hash(string)
    return sorted(word for word in words if anagram_hash(word) == target)

def match_permutations(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(word for word in words if word in perms)

def match_permutations_set(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(words & perms)

string_size = 5
words_size = 1000000

population = letters[: string_size+2]
words = set()
for i in range(words_size):
    word = ''.join([choice(population) for i in range(string_size)])
    words.add(word)
string = word                # Arbitrarily search use the last word as the target

print 'Timings with string_size=%d and words_size=%d' % (string_size, words_size)
for func in (match_sort, match_multiset, match_perfect_hash, match_permutations, match_permutations_set):
    start = time()
    func(string, words)
    end = time()
    print '%-10.5f %s' % (end - start, func.__name__)

【讨论】：

同意，这种情况下sorted 胜过Counter() 这是一个很好的答案。我希望所有的答案都这么好。不过，对 Python2 打印/格式化有点失望……（开个玩笑！）【参考方案2】：

您可以简单地使用collections.Counter() 将words 与string 进行比较，而无需创建所有permutations（这会随着字符串的长度而爆炸）：

from collections import Counter

def permutations_in_dict(string, words):
    c = Counter(string)
    return [w for w in words if c == Counter(w)]

>>> permutations_in_dict('act', 'cat', 'rat', 'dog', 'act')
['cat', 'act']

注意：sets 是无序的，因此如果您需要特定的顺序，您可能需要对结果进行排序，例如return sorted(...)

【讨论】：

【参考方案3】：

显然您希望输出按字母顺序排序，所以应该这样做：

return sorted(set(''.join(p) for p in itertools.permutations(string)) & words)

【讨论】：

【参考方案4】：

试试这个解决方案

list(map("".join, itertools.permutations('act')))
['act', 'atc', 'cat', 'cta', 'tac', 'tca']

我们可以称之为listA

listA = list(map("".join, itertools.permutations('act')))

您的列表是 ListB

listB = ['cat', 'rat', 'dog', 'act']

然后使用集合交集

list(set(listA) & set(listB))
['cat', 'act']

【讨论】：

为什么要执行将所有内容转换为列表的附加步骤？您不妨使用集合文字 ('cat', 'rat', 'dog', 'act') 并为 listA 省略 list(...)【参考方案5】：

为什么还要麻烦排列？如果您将单词视为字母字典，这是一个更简单的问题。我确信有一种理解可以做得比这更好，但是：

    letters = dict()
    for i in word:
      letters[i] = letters.get(i, 0) + 1

对单词执行此操作，然后对集合中的每个单词执行此操作，确保每个键的值大于或等于该单词键的值。如果是，请将其添加到您的输出中。

额外的好处：如果您的单词列表非常长，这应该很容易并行化。

【讨论】：

以上是关于过滤一组以匹配字符串排列的主要内容，如果未能解决你的问题，请参考以下文章