使用 Boost::regex 进行正则表达式组匹配

Posted 2023-02-21

技术标签:

【中文标题】使用 Boost::regex 进行正则表达式组匹配【英文标题】：Regular expression group matching using Boost::regex 【发布时间】：2014-12-22 08:39:28 【问题描述】：

我有格式字符串：

7XXXX 8YYYY 9ZZZZ 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL,

7XXXX 8YYYY 9ZZZZ 0LLLL 组可以重复任意次数； X、Y、Z、L 是数字；从 7,8,9,0 开始的组都按顺序进行可能缺少像7XXXX 0LLLL 8YYYY 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL 这样的组

我正在尝试使用 Boost::regex 库来实现我的目标。

我想拆分这些组并将它们放入数组或向量中。现在我正在尝试cout他们。

我正在尝试这样做，但我只能在 7、8、9、0 个组中的每个组中获得完整的字符串匹配或最后一个匹配，但不能获得像 7XXXX 8YYYY 9ZZZZ 0LLLL 这样的字符串

 const char* pat = "(([[:space:]]+7[0-9]4)0,1([[:space:]]+8[0-9]4)0,1([[:space:]]+9[0-9]4)0,1([[:space:]]+0[0-9]4)0,1)+";;
 boost::regex reg(pat);
 boost::smatch match;
 string example= "71122 85451 75415 01102 75555 82133 91341 02134";

 const int subgroups[] = 0,1,2,3,4,5,6;
 boost::sregex_token_iterator i(example.begin(), example.end(), reg, subgroups);
 boost::sregex_token_iterator j;

 while (i != j)
 
   cout << "Match: " << *i++ << endl;

样本输出：

Match: 71122 85451 75415 01102 75555 82133 91341 02134
<A bunch of empty "Match:" rows>
Match: 75555
Match: 82133
Match: 91341
Match: 02134
<A bunch of empty "Match:" rows>

但我想这样得到它：

71122 85451 
75415 01102 
75555 82133 91341 02134

我知道我做错了，无法使用正则表达式来做我想做的事情:(为什么我不能使用括号获得所有递归匹配？

【问题讨论】：

你为什么不能用space分割然后把所有东西都抢过来我不只是需要将它们分开。我需要 7、8、9、0 组分开。就像在字符串中获取每个 7XXXX 8YYYY 9ZZZZ 0LLLL 一样有趣的问题。我感觉到X/Y problem。我认为你根本不需要正则表达式。此外，您没有说明，但您可能希望将数字解析为整数 71122, 85451 或 1122, 5451 而不是字符串。我的回答显示了如何做到这两点。 【参考方案1】：

编辑：由于我第一次完全误解了，我将替换整个答案。我的想法是这样的：

const char* pat = "[[:space:]]+((7[0-9]4)?([[:space:]]+8[0-9]4)?([[:space:]]+9[0-9]4)?([[:space:]]+0[0-9]4)?)";
boost::regex reg(pat);
boost::smatch match;

//                    v-- extra space here to make the match easier.
std::string example= " 71122 85451 75415 01102 75555 82133 91341 02134";

boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;

while (i != j)

  std::cout << "Match: " << *i++ << std::endl;

如果字符串不能修改，空匹配问题的解决方法是

const char* pat = "((7[0-9]4)?([[:space:]]+8[0-9]4)?([[:space:]]+9[0-9]4)?([[:space:]]+0[0-9]4)?)";
boost::regex reg(pat);
boost::smatch match;
std::string example= "71122 85451 75415 01102 75555 82133 91341 02134";

boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;

while (i != j)

  if(i->length() != 0) 
    std::cout << "Match: " << *i << std::endl;
  

  ++i;

虽然在这种情况下使用regex_iterator 而不是regex_token_iterator 可能会更好：

// No need for outer spaces anymore
const char* pat = "(7[0-9]4)?([[:space:]]+8[0-9]4)?([[:space:]]+9[0-9]4)?([[:space:]]+0[0-9]4)?";

boost::sregex_iterator i(example.begin(), example.end(), reg);
boost::sregex_iterator j;

// Rest the same.

【讨论】：

这对我不起作用。它本质上只会被空间分割。我需要这些 7,8,9,0 起始组按顺序分开，而不仅仅是每个数字组分开。我包含了一个我想要的输出示例:) 哦，我误会了。那么请稍等。我可以通过比较第一个字符并开始一个新字符串（如果它小于或等于前一个字符）将它们连接成字符串 :) 或类似的东西可以在字符串前面加一个空格吗？否则正则表达式会有点笨拙。我认为它宁可帮助而不是造成任何伤害；）【参考方案2】：

我想我会在这里手动滚动一个解析器。为了敏捷，用 Spirit 解析怎么样

它直接解析成序列向量。处理空白没有问题。语法在syntax that somewhat resembles regular expressions but is tied in with the C++ language much stronger 中以声明方式描述。

它非常清楚地表达了意图：序列是按预期顺序排列的项目的任意组合——只要结果至少有一个项目

seq_  = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0');

其中item_ 解析以指定数字开头的任何整数：

item_  = &char_(_r1) >> uint_;

在解析器中，我们使用*seq 解析任意数量的序列，这就是为什么我们添加了一个检查每个匹配的序列是否为空（否则我们可能会在相同的输入位置得到一个匹配空序列的无限循环）

eps(phx::size(_val) > 0) // require 1 element at least

注意调试是如何内置的（通过取消第一行的注释来启用它）。

注意通过省略前导字符从结果中排除前导数字是多么简单：See alternative version on Coliru：

item_  = omit[char_(_r1)] >> uint_;

测试程序输出：

Parsing: 71122 85451 75415 01102 75555 82133 91341 02134
Parsed: 3 sequences

seq:    71122 85451 
seq:    75415 1102 
seq:    75555 82133 91341 2134

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi  = boost::spirit::qi;
namespace phx = boost::phoenix;

using data = std::vector<std::vector<unsigned> >;

template <typename It, typename Skipper = qi::space_type> 
struct grammar : qi::grammar<It, data(), Skipper> 
    grammar() : grammar::base_type(start) 
        using namespace qi;

        start = *seq_;

        seq_  = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0')
              >> eps(phx::size(_val) > 0)
              ;

        item_ = &char_(_r1) >> uint_;

        BOOST_SPIRIT_DEBUG_NODES((start)(item_)(seq_))
    

  private:
    qi::rule<It, unsigned(char), Skipper> item_;
    qi::rule<It, std::vector<unsigned>(), Skipper> seq_;
    qi::rule<It, data(), Skipper> start;
;

int main()  

    for (std::string const input : 
            "71122 85451 75415 01102 75555 82133 91341 02134"
            )
    
        using It = std::string::const_iterator;
        grammar<It> p;
        auto f(input.begin()), l(input.end());

        data parsed;
        bool ok = qi::phrase_parse(f,l,p,qi::space,parsed);

        std::cout << "Parsing: " << input << "\n";
        if (ok) 
            std::cout << "Parsed: " << parsed.size() << " sequences\n";
            for(auto& seq : parsed)
                std::copy(seq.begin(), seq.end(), std::ostream_iterator<unsigned>(std::cout << "\nseq:\t", " "));
            std::cout << "\n";
         else 
            std::cout << "Parsed failed\n";
        

        if (f!=l)
            std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";

【讨论】：

以上是关于使用 Boost::regex 进行正则表达式组匹配的主要内容，如果未能解决你的问题，请参考以下文章