通过Python中的正则表达式优化在两个列表之间查找匹配子字符串

Posted

技术标签:

【中文标题】通过Python中的正则表达式优化在两个列表之间查找匹配子字符串【英文标题】:Optimizing finding matching substring between the two lists by regex in Python 【发布时间】:2019-08-08 18:04:24 【问题描述】:

这是我在包含“短语”的列表中查找子字符串的方法,方法是通过包含“单词”的列表进行搜索,并返回在包含短语的列表中的每个元素中找到的匹配子字符串。

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)

# (desired and actual) output
[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

由于“单词”(或 list_to_search)列表有约 1700 个词,而“短语”(或 list_to_be_searched)列表有约 26561 个,因此完成代码需要 30 多分钟。考虑到 Pythonic 的编码方式和高效的数据结构,我认为我上面的代码没有被实现。 :(

谁能提供一些建议来优化或加快速度?

谢谢!

其实我上面写错了例子。 如果 'list_to_search' 有超过 2 个单词的元素怎么办?

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['hello my'],
 ['name', 'is'],
 ['name', 'is'],
 [],
 ['name', 'is', 'is your name', 'your'],
 ['name', 'is']]

时间 第一种方法:

%%timeit
def is_phrase_in(phrase, text):
    return re.search(r"\b\b".format(phrase), text, re.IGNORECASE) is not None

    list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
    list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
#43.2 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

第二种方法(嵌套列表理解和 re.findall)

%%timeit
[[j for j in list_to_search if j in re.findall(r"\b\b".format(j), i)] for i in list_to_be_searched]
#40.3 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\

时间确实有所改善,但有更快的方法吗?或者,考虑到它的作用,这项任务在基因上是缓慢的?

【问题讨论】:

【参考方案1】:

您可以使用嵌套列表推导:

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name',
                       'how are you', 'what is your name', 'my name is jane doe']

[[j for j in list_to_search if j in i.split()] for i in list_to_be_searched]

[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

【讨论】:

可能先将list_to_search 转换为set,然后使用re.findall\b 而不是split【参考方案2】:

虽然最直接/清晰的方法是使用列表推导,但我想看看正则表达式是否可以做得更好。

list_to_be_searched 中的每个项目使用正则表达式似乎没有任何性能提升。但是将list_to_be_searched 加入一大块文本并将其与由list_to_search 构造的正则表达式模式匹配,性能略有提高:

In [1]: import re
   ...:
   ...: list_to_search = ['my', 'name', 'is', 'you', 'your']
   ...: list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
   ...:
   ...: def simple_method(to_search, to_be_searched):
   ...:   return [[j for j in to_search if j in i.split()] for i in to_be_searched]
   ...:
   ...: def regex_method(to_search, to_be_searched):
   ...:   word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
   ...:   blob = '\n'.join(to_be_searched)
   ...:   phrases = word.findall(blob)
   ...:   return [phrase.split(' ') for phrase in ' '.join(phrases).split('\n ')]
   ...:
   ...: def alternate_regex_method(to_search, to_be_searched):
   ...:   word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
   ...:   phrases = []
   ...:   for item in to_be_searched:
   ...:     phrases.append(word.findall(item))
   ...:   return phrases
   ...:

In [2]: %timeit -n 100 simple_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.1 µs per loop

In [3]: %timeit -n 100 regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 18.6 µs per loop

In [4]: %timeit -n 100 alternate_regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.4 µs per loop

为了了解这在大输入下的表现如何,我使用了 1000 个最常用的英语单词1,一次一个单词作为list_to_search,以及整个文本来自 Project Gutenberg2 的 David Copperfield 一次取一行为 list_to_be_searched

In [5]: book = open('/tmp/copperfield.txt', 'r+')

In [6]: list_to_be_searched = [line for line in book]

In [7]: len(list_to_be_searched)
Out[7]: 38589

In [8]: words = open('/tmp/words.txt', 'r+')

In [9]: list_to_search = [word for word in words]

In [10]: len(list_to_search)
Out[10]: 1000

结果如下:

In [15]: %timeit -n 10 simple_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 31.9 s per loop

In [16]: %timeit -n 10 regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.28 s per loop

In [17]: %timeit -n 10 alternate_regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.43 s per loop

因此,如果您热衷于性能,请使用任何一种正则表达式方法。希望有帮助! :)

【讨论】:

感谢您的详细解答!这真的很有帮助。但是“regex_method”能否像多个单词一样捕捉?

以上是关于通过Python中的正则表达式优化在两个列表之间查找匹配子字符串的主要内容,如果未能解决你的问题,请参考以下文章

python中列表的增删改查

正则表达式在 Python 中查找两个字符串之间的字符串

使用 Python 正则表达式在两个变量之间查找 HTML

不止一次用正则表达式替换两个字符串之间的字符串,python

python 使用正则表达式替换两个实体模式之间的字符串

Pandas DataFrame 中的正则表达式 - 查找字符之间的最小长度