如何在 C++ 中对多个正则表达式使用正则表达式“分组”？

Posted 2023-02-22

技术标签:

【中文标题】如何在 C++ 中对多个正则表达式使用正则表达式“分组”？【英文标题】：How to use the regular expression "Grouping" for mulitple regex in C++? 【发布时间】：2014-11-03 02:18:35 【问题描述】：

这是上一个 SO 问题及其讨论的延续。

Different Between std::regex_match & std::regex_search

在我的 SO 问题中，编写了以下正则表达式以从给定的输入字符串中获取 day：

std::string input "Mon Nov 25 20:54:36 2013" ;
//Day:: Exactly Two Number surrounded by spaces in both side
std::regex  rR"(\s\d2\s)";

在其中一个答案中，将其更改为 R"(.*?\s(\d2)\s.*)" 以创建并因此捕获组和第一个子匹配。使用regex_match 或regex_search 解析日期信息时一切正常。

现在我写了以下regex expressions 来解析上面输入字符串中的各种内容，如下所示：

std::string input "Mon Nov 25 20:54:36 2013" ;


   //DayStr:: Exactly Three Letter at the start and followed by spaces(Output: Mon)
    std::regex   dayStrPattern R"(^\w3\s)" ;
    //Day:: Exactly Two Number surrounded by spaces in both side(Output: 25)
    std::regex   dayPattern R"(\s\d2\s)" ;
    //Month:: Exactly Three letter surrounded by spaces in both side(Output: Nov)
    std::regex   monthPattern R"(\s\w3\s)" ;
    //Year:: Exactly Four Number at the end of the string(Output: 2013)
    std::regex   yearPattern R"(\s\d4$)" ;
    //Hour:: Exactly two Number surrounded by spaces in left side and : in right side(Output:20)
    std::regex   hourPattern R"(\s\d2:1)" ;
    //Min:: Exactly two Number sorruounded by : in left side and : in right side(Output: 54)
    std::regex   minPattern R"(:1\d2:1)" ;
    //Second::Exactly two Number surrounded by : in the left side and space in right side(Output: 36)
    std::regex   secPattern R"(:1\d2\s)" ;

我已经测试了上述正则表达式here，它们似乎是正确的。

现在我们可以在这里使用分组机制，以便我们在std::regex_search方法中传递单个正则表达式而不是7个不同的正则表达式.?.这样 std::regex_search 会将输出存储到其std::smatch 子匹配向量中。这边可以吗。。我读过documentation 和A Tour Of C++ book，但对regular expression grouping 不太了解。

一般来说，我们应该何时以及如何使用/设计分组，以便我们在一次调用 std::regex_search 时获得各种信息？

此时我必须用不同的正则表达式调用 7 次 std::regex_search 来获取各种信息然后使用它。我肯定认为有比我现在做的更好的方法来实现它。

【问题讨论】：

我们可以对日期字段的顺序做任何假设吗？如果没有这样的假设，您当前的方法可能比单一的正则表达式解决方案更好。 @nhahtdh：是的，我们可以假设仅以这种方式排序。这里的主要思想是了解何时（不仅仅是这个例子）以及如何在正则表达式中使用分组。问题有误。我建议在\d2 周围添加括号来创建捕获组。所以我的答案中的regex 是R"(.*?\s(\d2)\s.*)"（注意额外的括号）。您的示例代码没有定义捕获组。 @Praetorian：我已经编辑了问题并更正了它。 【参考方案1】：

无需调用regex_match 7 次来匹配同一输入的 7 个部分，只需创建多个捕获组而不是每次都创建一个。例如，将您的 regex 更改为

std::regex rR"(^(\w3) (\w3) (\d2) (\d2):(\d2):(\d2) (\d4)$)";

然后单次调用regex_match就可以通过match_results获取所有匹配项

if (std::regex_match(input,match,r))
    for(auto const& m : match) 
        std::cout << m << '\n';

Live demo

【讨论】：

分组时的顺序是否应该与输入同步？你能解释一下这个概念吗？ @MantoshKumar 是的，输入中的字段始终需要按照显示的顺序排列。此外，您可能希望将数字捕获更改为 (\d1,2) 以处理单个数字日期和时间（例如，如果它是本月的第一天并且您的输入未填充零）。 @MantoshKumar 不确定您在寻找什么解释。您想从输入中提取 7 个字段，因此我创建了 7 个捕获组，每个组一个。有 8 个match_results 输出，因为第零个元素始终是整个匹配项。其余 (1-7) 是输入字符串中的字段，其顺序与 regex 中的捕获组相同。你有 8 个 match_results 因为它有 8 个括号对。 8 个 match_result 的顺序与其左括号出现的顺序相同。 @Praetorian：感谢您提供出色的信息和分析。您解释了我对 SO question 的所有疑问。

以上是关于如何在 C++ 中对多个正则表达式使用正则表达式“分组”？的主要内容，如果未能解决你的问题，请参考以下文章