如何在没有正则表达式的情况下在 C++ 中实现有效的全字字符串替换？

Posted 2023-02-22

技术标签:

【中文标题】如何在没有正则表达式的情况下在 C++ 中实现有效的全字字符串替换？【英文标题】：How can I implement an efficient whole-word string replacement in C++ without regular expressions? 【发布时间】：2011-05-09 20:06:00 【问题描述】：

也许我忽略了一些显而易见的事情，但我想知道在 C++ 中实现全字字符串替换的最快方法可能是什么。起初我考虑只是将空格连接到搜索词，但这没有考虑字符串边界或标点符号。

这是我目前对（非全词）替换的抽象：

void Replace(wstring& input, wstring find, wstring replace_with) 
  if (find.empty() || find == replace_with || input.length() < find.length()) 
      return;
  
  for (size_t pos = input.find(find); 
              pos != wstring::npos; 
              pos = input.find(find, pos)) 

      input.replace(pos, find.length(), replace_with);
      pos += replace_with.length();

如果我只将空格视为单词边界，我可以通过将搜索字符串的开头和结尾与查找字符串进行比较以覆盖字符串边界，然后使用 Replace(L' ' + find + L' ')....但我想知道是否有更优雅的解决方案可以有效地包含标点符号。

让我们认为一个单词是由空格或标点符号分隔的任何字符集合（为了简单起见，我们说！”#$%&'()*+,-./ 至少 - 这恰好发生在对应(c > 31 && c < 48))。

在我的应用程序中，我必须在相当大的短字符串数组上调用此函数，其中可能包括我不想拆分新单词的各种 Unicode。我也想避免包含任何外部库，但 STL 很好。

不使用正则表达式的目的是保证更少的开销，以及在大型数据集上适合此特定任务的快速函数的目标。

【问题讨论】：

旁注：如果输入很长并且您在开始时进行替换，替换可能会非常慢。我建议连接到一个字符串缓冲区（例如 std::stringstream），然后一步覆盖输入。 Unicode 要求会让事情变得更加棘手。我知道您正在尝试避免使用正则表达式并添加库，但您可以查看 ICU - 它具有基于正则表达式的替换功能 (regex docs)，并且可以让您使用 \b“单词边界”元字符. 【参考方案1】：

我认为您可以做到这一点，既可以进行全词匹配，也可以有效地进行。关键是：

使用 'std::isalpha' 检测“全字”边界，应该适用于 Unicode 和任何语言环境。通过创建一个单独的“输出”字符串来替换“不合适”，在处理结束时将其与“输入”交换，而不是在“输入”字符串本身“就地”进行工作。

这是我对你的功能的看法：

#include <cctype> // isalpha
#include <ciso646> // or, not
#include <string> // wstring

using std::size_t;
using std::wstring;

/// @brief Do a "find and replace" on a string.
/// @note This function does "whole-word" matching.
/// @param[in,out] input_string The string to operate on.
/// @param[in] find_string The string to find in the input.
/// @param[in] replace_string The string to replace 'find_string'
///            with in the input.
void find_and_replace( wstring& input_string,
                       const wstring& find_string,
                       const wstring& replace_string )

  if( find_string.empty()
      or find_string == replace_string
      or input_string.length() < find_string.length() )
  
    return;
  

  wstring output_string;
  output_string.reserve( input_string.length() );
  size_t last_pos = 0u;
  for( size_t new_pos = input_string.find( find_string );
       new_pos != wstring::npos;
       new_pos = input_string.find( find_string, new_pos ) )
  
    bool did_replace = false;
    if( ( new_pos == 0u
          or not std::isalpha( input_string.at( new_pos - 1u ) ) )
        and ( new_pos + find_string.length() == input_string.length()
              or not std::isalpha( input_string.at( new_pos + find_string.length() ) ) ) )
    
      output_string.append( input_string, last_pos, new_pos - last_pos );
      output_string.append( replace_string );
      did_replace = true;
    
    new_pos += find_string.length();
    if( did_replace )
    
      last_pos = new_pos;
    
  
  output_string.append( input_string, last_pos,
                        input_string.length() - last_pos );

  input_string.swap( output_string );

附：我不确定“replace_all”在您最初的示例中试图完成什么，因此为了清楚起见，我将其从解决方案中删除。

P.P.S.使用 Regex-es，这段代码会更干净。您可以依赖 C++ TR1 或 C++ 2011 功能吗？他们提供了一个标准的“正则表达式”库。

【讨论】：

一夜之间想了想，看到@Code_So1dier 的回答，我应该注意到，在你的问题中定义“整个词”的定义现在有点模糊。它是严格的空格，只是非字母字符还是非字母数字字符？对于我的示例，该决定将改变在 for 循环内完成的边界检查的逻辑。例如，如果只有“整个单词”边界是空白，则将 not std::isalpha( input_string.at( new_pos + find_string.length() ) ) 替换为 std::isspace( input_string.at( new_pos + find_string.length() ) )。在我的应用程序中，我可能只需要担心空格和标点符号，并且不希望在所有 unicode 上进行拆分（已编辑问题）。不使用正则表达式的目标是保证更少的开销，并希望有一个更快的函数适合这个单一的任务。确实，对于大多数应用程序来说，一对 \b 可能就足够了。 @sakatc 好的，所以空格或标点符号是一个“全词”边界；通过将not std::isalpha( input_string.at( new_pos + find_string.length() ) ) 替换为

std::isspace( input_string.at( new_pos + find_string.length() ) ) or std::ispunct( input_string.at( new_pos + find_string.length() ) )

来修改我的示例应该可以解决问题。请注意，您是否希望您的方法限制参数“find_string”的内容？例如，如果用户提交“/test”会造成混淆，因为它包含两类“whole-word”定界字符。我真的认为没有必要考虑搜索词中的词边界。如果程序员试图将多个单词字符串与整个单词搜索匹配，那么他们基本上已经要求该函数失败了......我不会在它上面做太多额外的工作。匹配例如“foo bar”之类的相邻单词超出了我需要的范围，但是如果您想探索它，请自行淘汰~ 顺便说一句，之前的 replace_all 用于递归替换，并不真正适用于整个单词替换。例如，要从缓冲区中减少多余的换行符，您可以这样做 Replace(buffer,"\n\n","\n",true);... 我会将其从问题中删除，因为它不适用于此处。【参考方案2】：

这是我的快速回复，但不知道解决方案有多快...... 这个问题的解决方案很少： 1.通过使用迭代器，比较每个单词（由空格分隔），为每个出现重新创建字符串：

string& remove_all_occurences(string& s, const string& str_to_remove, const string& str_to_put)
                typedef string::size_type string_size;
                string_size i = 0;
                string cur_string;
                cur_string.reserve(s.size());

                // invariant: we have processed characters [original value of i, i) 
                while (i != s.size()) 
                // ignore leading blanks
                // invariant: characters in range [original i, current i) are all spaces
                    while (i != s.size() && isspace(s[i]))
                    ++i;

                    // find end of next word
                    string_size j = i;
                    // invariant: none of the characters in range [original j, current j)is a space
                     while (j != s.size() && !isspace(s[j]))
                        j++;
                        // if we found some nonwhitespace characters 


                    if (i != j) 
                        // copy from s starting at the beginning to i, placing str to replace, and finishing with j to the end of s
                        cur_string = s.substr(i,j-i);
                        if(cur_string == str_to_remove)
                            s = s.substr(0,i) + str_to_put + s.substr(j,s.size() - j);
                        
                        i = j;
                    
                
                return s;

测试程序：

void call_remove_all_occurences()
                string my_str = "The quick brown fox jumps over sleepy dog fox fox fox";
                cout << remove_all_occurences(my_str,"fox","godzilla") << endl;

输出：

The quick brown godzilla jumps over sleepy dog godzilla godzilla godzilla

通过将字符串拆分为向量，而不是遍历向量并替换每个出现 - 简单...没有代码，但您明白了...

【讨论】：

以上是关于如何在没有正则表达式的情况下在 C++ 中实现有效的全字字符串替换？的主要内容，如果未能解决你的问题，请参考以下文章