如何在 python 中使用正则表达式替换模式？

Posted 2023-02-23

技术标签:

【中文标题】如何在 python 中使用正则表达式替换模式？【英文标题】：How to replace a pattern using regex in python? 【发布时间】：2017-08-18 16:08:58 【问题描述】：

我有一个如下所示的数据集：

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

我想更改列表，使其看起来像这样：

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

每个名称都必须更改为它们所属的类。我注意到在数据集中，列表中的每个新类都用“###”表示。所以我可以通过'###'将数据集分成块并计算###的实例。然后使用正则表达式查找名称，并将它们替换为### 的计数。

我的代码如下所示：

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

这似乎不起作用 - 没有替换。

【问题讨论】：

Python string.replace regular expression的可能重复如果您实际使用大括号，那不是有效的 Python 语法。你是用 Word 还是什么编程？这是什么意思？哦，不抱歉，我将我的代码从文本文件复制到了这里。傻您好-您建议的帖子上的答案对我不是特别有帮助关闭你的文件！ 【参考方案1】：

运行您提供的代码时，我得到以下回溯输出：

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

发生错误是因为 type(match) 计算结果为一个列表。当我在 PDB 中检查这个列表时，它是一个空列表。这是因为match 有两个for 循环超出了范围。因此，让我们将它们组合起来：

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

现在您在match 中获取内容，但还有一个问题：re.findall 的返回类型是字符串列表。 str.replace(...) 需要一个字符串作为它的第一个参数。

您可以作弊，将违规行更改为print(line.replace(match[0], prefix + str(triple_hash_count)))——但这假设您确定要在不是### 的每一行上找到正则表达式匹配项。一种更有弹性的方法是在尝试调用 str.replace() 之前检查是否有匹配项。

最终代码如下所示：

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

还有两件事：

triple_hash_count

hash_count

line.replace(match, prefix + str(triple_hash_count))

【讨论】：

这个解决方案也替换了'.1'等等您的答案是正确的，但 OP 的正则表达式需要调整以解决尾随 '.1'、'.2' 等的行 @PaulBack 我注意到您在帖子中进行了更改。但我推荐pattern = r'Name=([^\.\d;]*)，这样它就不会摄取名称和唯一性计数器之间的句点。不错的收获。我做出了改变。【参考方案2】：

问题的根源在于使用了第二个循环（以及错误命名的变量）。这将起作用。

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count))

【讨论】：

不需要正则表达式[^\.\d;]* 中的\d 和*！这：r'=(.*?)[\.;]' 满足所有需要【参考方案3】：

虽然您已经有了答案，但您可以使用正则表达式在几行中完成（它甚至可以是单行，但这不是很可读）：

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

它的作用：

MULTILINE

^

$

###

Name

###

enumerate()

###

作为单线（虽然不推荐）：

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

演示

A demo 说了几千字。

【讨论】：

以上是关于如何在 python 中使用正则表达式替换模式？的主要内容，如果未能解决你的问题，请参考以下文章

Python 3 替换字符串正则表达式

python 使用正则表达式替换两个实体模式之间的字符串

正则表达式

Python - 用正则表达式模式替换 DataFrame 中列表中的单词

Shell脚本——正则表达式

Python正则表达式之re.match()