替换所有连续重复的字母，忽略特定的单词

Posted 2023-02-23

技术标签:

【中文标题】替换所有连续重复的字母，忽略特定的单词【英文标题】：Replace all consecutive repeated letters ignoring specific words 【发布时间】：2020-12-22 16:47:15 【问题描述】：

我看到很多建议在 python 中使用 re (regex) 或 .join 删除句子中连续重复的字母，但我希望对特殊词有例外。

例如：

我要这句话>sentence = 'hello, join this meeting heere using thiis lllink'

变成这样 > 'hello, join this meeting here using this link'

知道我有这个单词列表要保留并忽略重复的字母检查：keepWord = ['Hello','meeting']

我发现有用的两个脚本是：

使用 .join：

import itertools

sentence = ''.join(c[0] for c in itertools.groupby(sentence))

使用正则表达式：

import re

sentence = re.compile(r'(.)\11,').sub(r'\1', sentence)

我有一个解决方案，但我认为还有一个更紧凑、更高效的解决方案。我现在的解决方案是：

import itertools

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']

new_sentence = ''

for word in sentence.split():
    if word not in keepWord:
        new_word = ''.join(c[0] for c in itertools.groupby(word))
        new_sentence = sentence +" " + new_word
    else:
        new_sentence = sentence +" " + word

有什么建议吗？

【问题讨论】：

如果出现Hellllo，您有什么期望？好吧，我的建议中没有处理这种情况，这可以通过忽略else 下第一次出现的字母来解决。 【参考方案1】：

您可以匹配keepWord 列表中的整个单词，并且只替换其他上下文中两个或多个相同字母的序列：

import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:'|'.join(keepWord))\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link

见Python demo

正则表达式看起来像

\b(?:hello|meeting)\b|([^\W\d_])\1+

请参阅regex demo。如果第 1 组匹配，则返回其值，否则，放回完全匹配（要保留的单词）。

模式详情

\b(?:hello|meeting)\b - hello 或 meeting 用字边界括起来 | - 或 ([^\W\d_]) - 第 1 组：任何 Unicode 字母 \1+ - 一个或多个对第 1 组值的反向引用

【讨论】：

太好了，这完全符合预期的输出。谢谢 @Aisha 如果您需要不区分大小写的搜索，请在正则表达式模式的开头添加 (?i)。或者将第四个参数添加到re.sub：re.sub(..., ..., sentence, flags=re.I)【参考方案2】：

虽然不是特别紧凑，但这里有一个使用正则表达式的相当简单的示例：函数subst 将用单个字符替换重复的字符，然后使用re.sub 来调用它找到的每个单词。

这里假设因为您的示例 keepWord 列表（第一次提到）的标题大小写为 Hello，但文本的 hello 为小写，因此您希望与列表。因此，无论您的句子是否包含Hello 或hello，它都将同样有效。

import re

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['Hello','meeting']

keepWord_s = set(word.lower() for word in keepWord)

def subst(match):
    word = match.group(0)
    return word if word.lower() in keepWord_s else re.sub(r'(.)\1+', r'\1', word)

print(re.sub(r'\b.+?\b', subst, sentence))

给予：

hello, join this meeting here using this link

【讨论】：

以上是关于替换所有连续重复的字母，忽略特定的单词的主要内容，如果未能解决你的问题，请参考以下文章

熊猫只删除连续重复的行，忽略特定的列

如何在R语言中查找具有连续字母的字符串中的单词

如何使用 REGEXP_REPLACE 替换特定条件下的重复单词？

替换 2D numpy 数组中的连续重复项

正则表达式忽略特殊字符和大写字母[重复]

查找重复的单词