如何使用正则表达式将 asterix * 替换为 html 标签

Posted 2023-02-23

技术标签:

【中文标题】如何使用正则表达式将 asterix * 替换为 html 标签【英文标题】：How do I replace asterix * with html tags using regex 【发布时间】：2021-09-08 05:41:17 【问题描述】：

我正在尝试使用正则表达式将字符串中的 * 替换为  或  标记。

例如： My *name* is John 输出My name is John

但是，如果相邻有**，我不想用 替换它们。

我有以下代码。问题是当我运行它时，它会将** 替换为 和。我想要

Hello *there* are two aster**es next to each other

输出

Hello <em>there</em> are two aster**es next to each other

我得到了

Hello <em>there</em> are two aster<em></em>es next to each other

我的代码：

def emphasis(string):
 
    regex = re.compile('(\s?)\*(.*?)\*(\s?)')
    return re.sub(regex, replace_function, string)
    

def replace_function(input):
    match = input.group()
    if match:
        return re.sub('(\s?)\*(.*?)\*(\s?)', '\\1<em>\\2</em>\\3', match)

我的测试：

def test_if_two_asterix_are_next_to_each_other(self):
        title = "Hello *there* are two aster**es next to each other"
        expected = "Hello <em>there</em> are two aster**es next to each other"
        actual = _emphasis(title)
        self.assertEqual(actual,expected)

测试失败，我得到：

Hello <em>there</em> are two aster<em></em>es es next to each other

【问题讨论】：

*Hel lo * there* ar**e two *ast***er*es ne*xt to **each ot**her*********** 的输出应该是什么？ 【参考方案1】：

markdown 库可能是这里最合适的解决方案。

但是，就正则表达式而言，问题在于起始和尾随分隔符是相同的字符。当您尝试匹配除该字符之外的一个或多个字符时，您可能会从上一个不成功的匹配中捕获尾随 * 并匹配到下一个匹配的前导 *。

因此，最简单的正则表达式解决方案是匹配两个连续的* 字符并匹配*，除* 之外的任何零个或多个字符，然后在其他上下文中匹配*。捕获两个星号之间的内容，并用您想要的标签将其包装在用作替换参数的可调用对象中：

import re
pattern = r"\*2,|\*([^*]*)\*"
text = "Hello *there* are two aster**es next to each other"
print( re.sub(pattern, lambda x: f'<em>x.group(1)</em>' if x.group(1) else x.group(), text) )
## => Hello <em>there</em> are two aster**es next to each other

请参阅Python demo。

【讨论】：

很好的答案！更一般地，您可以将两个 或更多 前面的标记与 min, 匹配。这在这里可能更有用。 \*2,|\*([^*]*)\* 谢谢 - 这非常有用，也很好解释。对此，我真的非常感激。我仍然遇到的唯一问题是，现在，如果我有例如字符串：String with *valid* and invalid* *emphasis 我得到结果：String with valid and invalid emphasis 而不是我想要的：String with valid and invalid* *emphasis 我认为这与两个**之间有空格。你知道我怎么能解决这个问题吗？非常感谢您的帮助 - 非常感谢！ @LaraThompson 您需要在问题中解释匹配规则。如何定义有效的*？我们如何找到无效的*s？或者你想跳过两个*s，它们之间有一个或多个空格？试试pattern = r"\*2,|\*\s+\*|\*([^*]*)\*" 甚至pattern = r"\*(?:\s*\*)+|\*([^*]*)\*" 对不起，好点，我可能还不够清楚。有效的 * 是： - 出现在单词的开头/结尾，并且匹配的 * 位于单词的开头/结尾，例如。这 is 有效 - 无效 * 是没有匹配的 * 例如。此无效或连续有多个 *，例如。此 re*y 无效。希望这会有所帮助 - 对不起，我是正则表达式的新手！ @LaraThompson 那么，单词边界？ \*2,|\B\*\B|\b\*\b|\*\b([^*]*)\b\*?【参考方案2】：

如果我理解你的问题，你正在尝试将 Markdown 转换为 html。

获得理想结果的最简单方法是：

在终端中运行

$ pip install markdown

在您的 Python 程序中：

$ python3
Python 3.9.5
>>> import markdown
>>> title = "Hello *there* are two aster**es next to each other. *Right?*"
>>> expected = markdown.markdown(title)
>>> expected
'<p>Hello <em>there</em> are two aster**es next to each other</p>'

有时很难构建一个覆盖所有极端情况的正则表达式。您自己找到了一个极端案例（如您的问题所述：**）。然而，还有许多我们都不知道甚至存在的其他人。已经有人为此构建了工具。

但是，如果您想了解有关 RegEx 的更多信息，网上有大量的资源。

祝你好运！

附：您可能会注意到markdown 函数使用 包装输出，这可能是不可取的。提示：我认为没有办法自动抑制这种情况（请参阅this）。从某种意义上说，你的问题的解决方案带来了另一个问题。但是，可以说，从输出中删除  和  比为您描述的问题找到正确的正则表达式要容易得多。提示 2：您可以使用

从输出中删除前 3 个和最后 4 个字符

expected = markdown.markdown(title)[3:-4:]

【讨论】：

以上是关于如何使用正则表达式将 asterix * 替换为 html 标签的主要内容，如果未能解决你的问题，请参考以下文章

如何使用正则表达式替换替换特殊字符？

Java +如何将包含'（'的字符串替换为'\\（'以用于pattern.compile正则表达式[重复]

如何在Java中使用正则表达式将“：abc，cde t”替换为“，abc | cde”？

如何使用正则表达式进行多次替换？

正则表达式替换 char 并在 .net C# 中转换为大写

如何在 Python 中应用正则表达式替换？

如何使用正则表达式将 asterix * 替换为 html <em> 标签