使用带有不情愿、贪婪和所有格量词的捕获组

Posted 2023-02-19

技术标签:

【中文标题】使用带有不情愿、贪婪和所有格量词的捕获组【英文标题】：Using capturing groups with reluctant, greedy, and possessive quantifiers 【发布时间】：2014-06-21 21:37:07 【问题描述】：

我在Oracle的教程中练习java的正则表达式。为了更好地理解贪婪、不情愿和所有格量词，我创建了一些示例。我的问题是这些量词在捕获组时如何工作。我不明白以这种方式使用量词，例如，勉强的量词看起来好像根本不起作用。另外，我在互联网上搜索了很多，只看到了像 (.*?) 这样的表达式。为什么人们通常使用具有这种语法的量词而不是 "(.foo)??" 之类的东西，有什么原因吗？

这是不情愿的例子：

Enter your regex: (.foo)??
Enter input string to search: xfooxxxxxxfoo
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
I found the text "" starting at index 3 and ending at index 3.
I found the text "" starting at index 4 and ending at index 4.
I found the text "" starting at index 5 and ending at index 5.
I found the text "" starting at index 6 and ending at index 6.
I found the text "" starting at index 7 and ending at index 7.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
I found the text "" starting at index 10 and ending at index 10.
I found the text "" starting at index 11 and ending at index 11.
I found the text "" starting at index 12 and ending at index 12.
I found the text "" starting at index 13 and ending at index 13.

对于不情愿，它不应该为索引 0 和 4 显示“xfoo”吗？这里是所有格：

Enter your regex: (.foo)?+ 
Enter input string to search: afooxxxxxxfoo
I found the text "afoo" starting at index 0 and ending at index 4
I found the text "" starting at index 4 and ending at index 4.
I found the text "" starting at index 5 and ending at index 5.
I found the text "" starting at index 6 and ending at index 6.
I found the text "" starting at index 7 and ending at index 7.
I found the text "" starting at index 8 and ending at index 8.
I found the text "xfoo" starting at index 9 and ending at index 13.
I found the text "" starting at index 13 and ending at index 13.

对于所有格，它不应该只尝试一次输入吗？我真的很困惑，尤其是这个，因为尝试了每一种可能性。

提前致谢！

【问题讨论】：

检查quantifiers下的SO regex reference，它将链接到this question 【参考方案1】：

正则表达式引擎（基本上）从左侧开始逐个检查字符串的每个字符，试图使它们适合您的模式。它返回找到的第一个匹配项。

不情愿的量词应用于子模式意味着正则表达式引擎将优先考虑（如先尝试）以下子模式。

在aabab 上使用.*?b 逐步了解会发生什么：

aabab # we try to make '.*?' match zero '.', skipping it directly to try and 
^     # ... match b: that doesn't work (we're on a 'a'), so we reluctantly 
      # ... backtrack and match one '.' with '.*?'
aabab # again, we by default try to skip the '.' and go straight for b:
 ^    # ... again, doesn't work. We reluctantly match two '.' with '.*?'
aabab # FINALLY there's a 'b'. We can skip the '.' and move forward:
  ^   # ... the 'b' in '.*?b' matches, regex is over, 'aab' is a general match

在您的模式中，没有与 b 等效的内容。 (.foo) 是可选的，引擎优先考虑模式的以下部分。

nothing，匹配一个空字符串：找到一个整体匹配，它始终是一个空字符串。

关于所有格量词，您对它们的作用感到困惑。它们与匹配的数量没有直接关系：您用于应用正则表达式的聊天工具并不清楚，但它会查找全局匹配，这就是它不会在第一次匹配时停止的原因。

有关它们的更多信息，请参阅http://www.regular-expressions.info/possessive.html。

此外，正如 HamZa 所指出的，https://***.com/a/22944075 正在成为正则表达式相关问题的绝佳参考。

【讨论】：

对于正则表达式：a*?b 和模式：aabab，我可以说 b 应用于模式但没有找到，然后“不情愿地”使用 a* 并匹配？我有正则表达式的空值：a*?还是一个？？对于模式： aabab，所以，我可以说中间 * 和？可以取0作为值，先检查零？对于贪婪，正则表达式 a+ 和模式 aabab 给出 aa 结果，所以贪婪和不情愿之间的区别之一是，不情愿在“？”之后首先检查。特点？作为一个+？给出可预测的输出，因为 middle + 需要 1 或更多，根据我的假设，这似乎是正确的。我会说b 应用于当前字符，是的。 “首先检查零”我想也是一种说法。 a+? 等价于 aa*?：一旦匹配了一个 a，我们就会再次找到通常的模式。 @mwb：你必须稍微调整一下，但你可以在这里一步一步看到发生了什么：regex101.com/r/yQ5wO9，在调试器视图中（左中）。或者尝试找出java是如何自己输出的：***.com/questions/1137437/…

以上是关于使用带有不情愿、贪婪和所有格量词的捕获组的主要内容，如果未能解决你的问题，请参考以下文章

贪婪/懒惰（非贪婪）/占有量词如何在内部工作？ [复制]

sed中的非贪婪（不情愿）正则表达式匹配？

正则进阶之，回溯，（贪婪* 非贪婪+？独占++）三种匹配量词

为啥惰性量词后跟问号会变得贪婪？ [复制]

弄懂 Java 正则表达式中的量词贪婪勉强独占和 String 的 matches 方法的底层个人感觉非常值得学习

一起学习正则表达式量词与贪婪