如何为提及和主题标签修复此正则表达式？

Posted 2023-04-13

技术标签:

【中文标题】如何为提及和主题标签修复此正则表达式？【英文标题】：How to fix this regex for mentions and hashtags? 【发布时间】：2018-08-24 18:41:35 【问题描述】：

我已使用以下tool 为提及和主题标签构建了有效的regex。我已经成功匹配了我想要插入的文本，但我需要解决以下匹配问题。

仅匹配那些以空格开头和结尾的子字符串。并且对于子字符串在字符串开头或结尾的情况这是有效的（无论是标签还是提及），也接受它。

正则表达式找到的匹配只取不包含空格的部分，（空格只是规则的一部分，而不是子字符串的一部分）。

我使用的正则表达式如下：(([@]1|[#]1)[A-Za-z0-9]+)

字符串匹配有效和无效的一些例子：

"@hello friend" - @hello must be matched as a mention.
"@ hello friend" - here there should be no matches.
"hey@hello @hello" - here only the last @hello must be matched as a mention.
"@hello! hi @hello #hi ##hello" - here only the second @hello and #hi must be matched as a mention and hashtag respectively.

图片中的另一个示例，其中只有 "@word" 应该是有效的提及：

更新 16:35 (GMT-4) 3/15/18

我找到了解决问题的方法，在 PCRE 模式（服务器）下使用tool 并使用negative lookbehind 和negative lookahead：

(?<![^\s])(([@]1|[#]1)[A-Za-z0-9]+)(?![^\s])

这里是比赛：

但现在出现了疑问，它适用于 C#? 中的正则表达式，negative lookahead 和 negative lookbehind，因为例如在 javascript 中它不起作用，正如在工具中看到的那样，它用红线标记了我。

【问题讨论】：

有不同的正则表达式引擎支持不同的东西，所以在使用在线测试器时，您必须确保它使用正确的引擎，用于您使用正则表达式的语言。这也意味着一些正则表达式不能在不同的语言中重复使用。 【参考方案1】：

试试这个模式：

(?:^|\s+)(?:(?<mention>@)|(?<hash>#))(?<item>\w+)(?=\s+)

这里分解：

(?: 创建非捕获组 ^|\s+ 匹配字符串或空格的开头 (?: 创建非捕获组 (?<mention>@|(?<hash>#) 创建一个组来匹配@ 或# 并分别命名组mention 和hash (?<item>\w+) 匹配任何字母数字字符一次或多次，有助于从组中拉出项目以方便使用。 (?=\s+) 创造积极的前瞻性以匹配任何空白

小提琴：Live Demo

然后您需要使用底层语言来修剪返回的匹配项以删除任何前导/尾随空格。

更新既然您提到您使用的是 C#，我想我会为您提供一个 .NET 解决方案来解决您的问题，而不需要 RegEx；虽然我没有测试结果，但我想这也会比使用 RegEx 更快。

就我个人而言，我的 .NET 风格是 Visual Basic，所以我为您提供了一个 VB.NET 解决方案，但是您可以通过转换器轻松运行它，因为我从不使用任何不能用于C#：

Private Function FindTags(ByVal lead As Char, ByVal source As String) As String()
    Dim matches As List(Of String) = New List(Of String)
    Dim current_index As Integer = 0

    'Loop through all but the last character in the source
    For index As Integer = 0 To source.Length - 2
        'Reset the current index
        current_index = index

        'Check if the current character is a "@" or "#" and either we're starting at the beginning of the String or the last character was whitespace and then if the next character is a letter, digit, or end of the String
        If source(index) = lead AndAlso (index = 0 OrElse Char.IsWhiteSpace(source, index - 1)) AndAlso (Char.IsLetterOrDigit(source, index + 1) OrElse index + 1 = source.Length - 1) Then
            'Loop until the next character is no longer a letter or digit
            Do
                current_index += 1
            Loop While current_index + 1 < source.Length AndAlso Char.IsLetterOrDigit(source, current_index + 1)

            'Check if we're at the end of the line or the next character is whitespace
            If current_index = source.Length - 1 OrElse Char.IsWhiteSpace(source, current_index + 1) Then
                'Add the match to the collection
                matches.Add(source.Substring(index, current_index + 1 - index))
            End If
        End If
    Next

    Return matches.ToArray()
End Function

小提琴：Live Demo

【讨论】：

它几乎可以工作。问题在于，如果您在主题标签或提及之前键入一些字符，它就会匹配。解决方案不应该这样做。 @Andrespengineer - 我已经更新了模式，看看这是否适合你。现在子字符串的第一个空格被作为字符串匹配的一部分。我用一个可能的答案更新了我的问题，显然它有效。找到匹配项后消除间距的问题是我必须执行 (N * M) 次操作，其中 N 是输入更改的次数，M 是字符串的长度。我需要让我使用 Linq 返回匹配的索引，以便能够在 C# 中绘制它们，如果我更改找到的匹配项，索引也会更改，我将不得不执行更多操作来移动索引列表。谢谢，你帮了我很多。 @Andrespengineer - 你会考虑通过 C# 匹配所有内容吗？这会容易得多，我可以为您提供解决方案。 @Andrespengineer - 再次查看我的最新更新。【参考方案2】：

您可以在现有的正则表达式周围使用空格或空格来开始/结束行。

^ - 开始

$ - 结束

\s - 空格

(^|\s+)(([@]1|[#]1)[A-Za-z0-9]+)(\s+|$)

【讨论】：

可能希望那些是前瞻和后瞻，因此空格不是匹配的一部分。 @juharr 你是对的，空格不应该是字符串匹配的一部分。 @juharr 感谢您的评论，这对我帮助很大。看问题的更新解决我最后一个疑惑。【参考方案3】：

这个正则表达式可以为你完成这项工作。

[@#][A-Za-z0-9]+\s|\s[@#][A-Za-z0-9]+

| 运算符负责生成逻辑“或”，因此您有 2 个不同的表达式要匹配。

[@#][A-Za-z0-9]+\s

和

\s[@#][A-Za-z0-9]+

在哪里

\s - space

【讨论】：

它几乎可以工作。问题是，如果您在主题标签或提及之前键入一些字符或空格，它就会匹配。解决方案不应该这样做。查看我的问题的更新。

以上是关于如何为提及和主题标签修复此正则表达式？的主要内容，如果未能解决你的问题，请参考以下文章