如何从 C++ 中的字符串中快速查找和子串多个项目？

Posted 2023-02-22

技术标签:

【中文标题】如何从 C++ 中的字符串中快速查找和子串多个项目？【英文标题】：How to quickly find and substring multible items from string in C++? 【发布时间】：2017-12-11 12:44:05 【问题描述】：

我对 C++ 比较陌生，但我正在努力解决以下问题：我正在解析来自 iptables 的 syslog 消息。每条消息看起来像：192.168.1.1:20200:Dec 11 15:20:36 SRC=192.168.1.5 DST=8.8.8.8 LEN=250 而且我需要快速（因为新消息来得很快）解析字符串以获取 SRC、DST 和 LEN。如果它是一个简单的程序，我会使用std::find 来查找STR 子字符串的索引，然后在循环中将每个下一个字符添加到数组中，直到遇到空格。然后我会为DST 和LEN 做同样的事情。例如，

std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";
std::string substr;

std::cout << "Original string: \"" << x << "\"" << std::endl;

// Below "magic number" 4 means length of "SRC=" string 
// which is the same for "DST=" and "LEN="    

// For SRC
auto npos = x.find("SRC");
if (npos != std::string::npos) 
    substr = x.substr(npos + 4, x.find(" ", npos) - (npos+4));
    std::cout << "SRC: " << substr << std::endl;


// For DST
npos = x.find("DST");
if (npos != std::string::npos) 
    substr = x.substr(npos + 4, x.find(" ", npos) - (npos + 4));
    std::cout << "DST: " << substr << std::endl;


// For LEN
npos = x.find("LEN");
if (npos != std::string::npos) 
    substr = x.substr(npos + 4, x.find('\0', npos) - (npos + 4));
    std::cout << "LEN: " << substr << std::endl;

但是，在我的情况下，我需要非常快速地完成此操作，最好是一次迭代。你能给我一些建议吗？

【问题讨论】：

关于正则表达式的一点警告：有句话类似于“你有一个问题。你用正则表达式解决它。现在你有两个问题”。虽然正则表达式可以是一个强大的工具，但它也非常先进，而且绝对不平凡。正则表达式很容易出错，只能作为最后的努力使用。尤其是初学者，甚至是中级程序员。你有证据表明你的程序太慢了吗？ 【参考方案1】：

如果您的格式是固定的并经过验证（只要输入字符串不包含完全预期的字符，您就可以接受未定义的行为），那么您可能会通过手动编写较大的部分并跳过字符串终止来挤出一些性能测试将成为所有标准功能的一部分。

// buf_ptr will be updated to point to the first character after the " SRC=x.x.x.x" sequence
unsigned long GetSRC(const char*& buf_ptr)

    // Don't search like this unless you have a trusted input format that's guaranteed to contain " SRC="!!!
    while (*buf_ptr != ' ' ||
        *(buf_ptr + 1) != 'S' ||
        *(buf_ptr + 2) != 'R' ||
        *(buf_ptr + 3) != 'C' ||
        *(buf_ptr + 4) != '=') 
    
        ++buf_ptr;
    
    buf_ptr += 5;
    char* next;

    long part = std::strtol(buf_ptr, &next, 10);
    // part is now the first number of the IP. Depending on your requirements you may want to extract the string instead
    unsigned long result = (unsigned long)part << 24;

    // Don't use 'next + 1' like this unless you have a trusted input format!!!
    part = std::strtol(next + 1, &next, 10);
    // part is now the second number of the IP. Depending on your requirements ...
    result |= (unsigned long)part << 16;

    part = std::strtol(next + 1, &next, 10);
    // part is now the third number of the IP. Depending on your requirements ...
    result |= (unsigned long)part << 8;

    part = std::strtol(next + 1, &next, 10);
    // part is now the fourth number of the IP. Depending on your requirements ...
    result |= (unsigned long)part;

    // update the buf_ptr so searching for the next information ( DST=x.x.x.x) starts at the end of the currently parsed parts
    buf_ptr = next;
    return result;

用法：

const char* x_str = x.c_str();
unsigned long srcIP = GetSRC(x_str);
// now x_str will point to " DST=15.15.15.15 LEN=255" for further processing

std::cout << "SRC=" << (srcIP >> 24) << "." << ((srcIP >> 16) & 0xff) << "." << ((srcIP >> 8) & 0xff) << "." << (srcIP & 0xff) << std::endl;

注意，我决定将提取的整个源 IP 写入单个 32 位无符号的。如果需要，您可以决定完全不同的存储模型。

即使您不能对自己的格式感到乐观，使用在处理部分时更新的指针并继续使用剩余的字符串而不是从 0 开始可能是提高性能的好主意。

当然，我想你的std::cout << ... 行只是用于开发测试，否则所有的微优化都变得毫无用处。

【讨论】：

@Groosha 我想知道，您是否从问题中测试了您之前的解决方案并将其与正则表达式解决方案进行了比较？我感兴趣的是您是否真的从正则表达式中获得了任何性能，或者仅仅是可维护代码的美感。 @Groosha 你真的应该测试它...我创建了一些基准测试，其中包含 100000 个字符串的向量，这些字符串将被处理并存储到一个结果对象中。我得到了以下结果（我通过存储到内部结果对象中替换了输出）：我的代码：0.024sec 你的代码：0.044sec，正则表达式代码：0.617sec 所以基本上，正则表达式解决方案会慢 14 倍比您开始使用的代码。也许我在详细使用正则表达式的方式上做错了，但如果你关心性能，你真的应该赛马！ @Groosha 我不得不对每个版本进行一些改造以使其完全工作，但我希望我以反映原始代码的方式做到这一点：onlinegdb.com/BkfH9M3bz（grek40）onlinegdb.com/Bk0n5G3bz（问题) onlinegdb.com/Hyeqsf3WG (正则表达式) @Groosha 我有点感兴趣，您是继续使用正则表达式方法还是最终采用了不同的方法。毕竟你在问题中明确说明了性能要求；） @Groosha 正如所讨论的，您自己的代码实际上也相当快。因此，您可以稍微转向可读性和健壮性，而不会损失太多性能。请记住，Regex 更多的是关于编码性能和紧凑性，而不是关于最高执行速度。【参考方案2】：

“快速，理想情况下一次迭代” - 实际上，程序的速度并不取决于源代码中可见的循环数。尤其是正则表达式是隐藏多个嵌套循环的好方法。

您的解决方案实际上非常好。在查找“SRC”之前不会浪费太多时间，并且不会进行超出必要的搜索以检索 IP 地址。当然，当搜索“SRC”时，它在“Sep”的第一个“S”上有误报，但这可以通过下一次比较来解决。如果您确定“SRC”的第一次出现在第 20 列的某个位置，则可以跳过前 20 个字符来节省一点速度。（检查你的日志，我看不出来）

【讨论】：

【参考方案3】：

您可以使用std::regex，例如：

std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";

std::regex const r(R"(SRC=(\S+) DST=(\S+) LEN=(\S+))");
std::smatch matches;
if(regex_search(x, matches, r)) 
    std::cout << "SRC " << matches.str(1) << '\n';
    std::cout << "DST " << matches.str(2) << '\n';
    std::cout << "LEN " << matches.str(3) << '\n';

请注意，matches.str(idx) 会创建一个带有匹配项的新字符串。使用matches[idx]，您可以在不创建新字符串的情况下获取子字符串的迭代器。

【讨论】：

以上是关于如何从 C++ 中的字符串中快速查找和子串多个项目？的主要内容，如果未能解决你的问题，请参考以下文章