将 dict 值读取为正则表达式，返回匹配项

Posted 2023-02-23

技术标签:

【中文标题】将 dict 值读取为正则表达式，返回匹配项【英文标题】：Read dict values as regex, return matches 【发布时间】：2018-04-08 04:59:25 【问题描述】：

已提供解决方案 - 谢谢@ekhumoro！ 我有一个 python 字典，其中包含一个术语列表作为值：

myDict = 
    ID_1: ['(dog|cat[a-z+]|horse)', '(car[a-z]+|house|apple\w)', '(bird|tree|panda)'],
    ID_2: ['(horse|building|computer)', '(panda\w|lion)'],
    ID_3: ['(wagon|tiger|cat\w*)'],
    ID_4: ['(dog)']

我希望能够读取每个值中的列表项，作为单独的正则表达式，如果它们匹配任何文本，则将匹配的文本作为单独字典中的键返回，并使用它们的原始键（ID ) 作为值。因此，如果这些术语被解读为搜索此字符串的正则表达式：

"dog panda cat cats pandas car carts"

我想到的一般方法是这样的：

For key, value in myDict:
    for item in value:
        if re.compile(item) = match-in-text:
            newDict[match] = [list of keys]

预期的输出是：

newDict = 
    car: [ID_1],
    carts: [ID_1],
    dog: [ID_1, ID_4],
    panda: [ID_1, ID_2],
    pandas: [ID_1, ID_2],
    cat: [ID_1, ID_3],
    cats: [ID_1, ID_3]

匹配的文本应该作为 newDict 中的键返回仅当它们实际上匹配了文本正文中的某些内容。因此，在输出中，“购物车”列在那里，因为 ID_1 值中的正则表达式与之匹配。因此 ID 列在输出字典中。 解决方案

import re
from collections import defaultdict

text = """
the eye of the tiger
a doggies in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
the cationic cataclysm
the pandamonious panda pandas
      """

myDict = 
    'ID_1': ['(dog\w+|cat\w+|horse)', '(car|house|apples)', 
    '(bird|tree|panda\w+)'],
    'ID_2': ['(horse|building|computer)', '(panda\w+|lion)'],
    'ID_3': ['(wagon|tiger|cat)'],
    'ID_4': ['(dog)'],
    

newDict = defaultdict(list)

for key, values in myDict.items():
for pattern in values:
    for match in re.finditer(pattern, text):
        newDict[match.group(0)].append(key)

for item in newDict.items():
   print(item)

【问题讨论】：

你能提供一个预期输出的例子吗？ @scharette newDict 是我希望实现的输出。为了提供更多上下文 - myDict 的值包含一个正则表达式列表。它们正在针对一组文本运行，最后，只应返回这些 RegEx 的匹配项。很抱歉造成混乱并且没有在问题中提供更多信息，但感谢所有已经提供答案的人。但不幸的是，这不是通过简单的字符串格式可以完成的。需要通过将这些术语作为正则表达式运行来完成。为什么newDict 输出中没有汽车或苹果？ @AndyHayden 我在问题中提供了更多信息。 【参考方案1】：

这是一个似乎符合您要求的简单脚本：

import re
from collections import defaultdict

text = """
the eye of the tiger
a dog in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
"""

myDict = 
    'ID_1': ['(dog|cat|horse)', '(car|house|apples)', '(bird|tree|panda)'],
    'ID_2': ['(horse|building|computer)', '(panda|lion)'],
    'ID_3': ['(wagon|tiger|cat)'],
    'ID_4': ['(dog)'],
    

newDict = defaultdict(list)

for key, values in myDict.items():
    for pattern in values:
        for match in re.finditer(pattern, text):
            newDict[match.group(0)].append(key)

for item in newDict.items():
    print(item)

输出：

('dog', ['ID_1', 'ID_4'])
('cat', ['ID_1', 'ID_3'])
('horse', ['ID_1', 'ID_2'])
('bird', ['ID_1'])
('tiger', ['ID_3'])

【讨论】：

这非常有效。非常感谢您的快速回复。我对其进行了一些修改，以便通过插入以下内容来获取文本中模式的所有实例，而不是仅获取文本中模式的第一个实例：“如果匹配不是无：for g in match: screen = re.search(模式，g) newDict[screen.group(0)].append(key). @J_Micks。我确实想知道这一点，但从你的问题中并不清楚。我已经修改了我的答案，以便获得每个模式的所有匹配项。 @ekhumoro：出于好奇：这可以通过字典理解来完成吗？ @Jan.并不真地。多个模式可以匹配同一事物，因此输出字典需要在找到新匹配时不断更新。 dictcomp 将覆盖任何以前的匹配项。我想这可以通过对单独的 dict 使用副作用来完成 - 但我会说这并不能真正算作 dictcomp。【参考方案2】：

一种方法是将正则表达式转换为普通列表，例如使用字符串操作：

In [11]: id_: "|".join(ls).replace("(", "").replace(")", "").split("|") for id_, ls in myDict.items()
Out[11]:
'ID_1': ['dog',
  'cat',
  'horse',
  'car',
  'house',
  'apples',
  'bird',
  'tree',
  'panda'],
 'ID_2': ['horse', 'building', 'computer', 'panda', 'lion'],
 'ID_3': ['wagon', 'tiger', 'cat'],
 'ID_4': ['dog']

你可以把它做成一个DataFrame：

In [12]: from collections import Counter

In [13]: pd.DataFrame(id_:Counter( "|".join(ls).replace("(", "").replace(")", "").split("|") ) for id_, ls in myDict.items()).fillna(0).astype(int)
Out[13]:
          ID_1  ID_2  ID_3  ID_4
apples       1     0     0     0
bird         1     0     0     0
building     0     1     0     0
car          1     0     0     0
cat          1     0     1     0
computer     0     1     0     0
dog          1     0     0     1
horse        1     1     0     0
house        1     0     0     0
lion         0     1     0     0
panda        1     1     0     0
tiger        0     0     1     0
tree         1     0     0     0
wagon        0     0     1     0

【讨论】：

嘿，安迪，但是列表中的项目需要在某些文本正文中进行搜索，并且只有当它们最终匹配文本中的任何内容时，才会返回它们最初链接的 ID .非常抱歉我没有尽快提供重要信息，非常感谢您抽出宝贵时间回复！ @J_Micks 请使用示例正则表达式更新您的问题。为什么有一个正则表达式列表（它只需要匹配列表中的一个）？这个问题不是特别清楚。

以上是关于将 dict 值读取为正则表达式，返回匹配项的主要内容，如果未能解决你的问题，请参考以下文章