清除非字母字符的字符串ma

Posted 2023-02-22

技术标签:

【中文标题】清除非字母字符的字符串ma【英文标题】：Clean a string for non alphabetical charactersma 【发布时间】：2016-10-01 20:34:23 【问题描述】：

我正在尝试清除 C++ 中的字符串。我想为所有非字母字符清理它，并保持各种英文和非英文字母不变。我的一个测试代码看起来像这样

int main()

string test = "Danish letters: Æ Ø Å !!!!!!??||~";
cout << "Test = " << test << endl;

for(int l = 0;l<test.size();l++)

    if(!isalpha(test.at(l)) && test.at(l) != ' ')
    
        test.replace(l,1," nope");  
    


cout << "Test = " << test << endl;

return 0;

这给了我输出：

Test = Danish letters: Æ Ø Å !!!!!!??||~
Test = Danish letters nope  nope nope  nope nope  nope nope  nope nope nope nope nope nope nope nope nope nope nope"

所以我的问题是，如何删除“!!!!!!??||~”而不是“Æ Ø Å”？

我也尝试过类似的测试

test.at(l)!='Å'

但我无法编译，如果我将 'Å' 声明为字符。

我读过有关 unicode 和 utf8 的内容，但我不是很了解。

请帮帮我:)

【问题讨论】：

嗯，你需要继续阅读有关 unicode 和 utf8 的内容，直到你理解它，然后一切都应该一清二楚。您可能想查看标题为 How to strip all non alphanumeric characters from a string 的 SO 问题。我也有兴趣看看std::isalnum 是否适用于您的情况。 @RawN：这两个链接都只针对 ASCII，这个问题（隐含）是关于非 ASCII 的。 @MooingDuck 在 C++（或 C）中没有任何东西只适用于 ASCII。 @TomBlodget：从技术上讲，你是对的。从技术上讲，它们仅适用于字符编码的遗留子集。它们不适用于 Unicode 字符，此代码可能正在使用它。 【参考方案1】：

char 用于 ASCII 字符集，您正在尝试对具有非 ASCII 字符的字符串进行操作。

您正在对 Unicode 字符进行操作，因此您需要使用宽字符串操作：

int main()

    wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~";
    wcout << L"Test = " << test << endl;

    for(int i = 0; i < test.size(); i++) 

        if(!iswalpha(test.at(i)) && test.at(i) != ' ') 

            test.replace(i,1,L" nope");
        
    

    wcout << L"Test = " << test << endl;

    return 0;

你也可以利用Qt，使用QString，这样代码也一样平安：

QString test = "Danish letters: Æ Ø Å !!!!!!??||~";
qDebug() << "Test =" << test;

for(int i = 0; i < test.size(); i++) 

    if(!test.at(i).isLetterOrNumber() && test.at(i) != ' ') 

        test.replace(i, 1, " nope");
    


qDebug() << "Test = " << test;

【讨论】：

是的，这段代码只留下英文和非英文字符，因为我们使用的是iswalpha。哇，我的表情符号示例经过深思熟虑。重新开始：C++ 范围的函数和类仅适用于基本的多语言平面，当在补充平面中给定字符时失败，目前包含 73000 个字符，其中一些必须是字母字符。 iswalpha 已损坏。 en.wikipedia.org/wiki/… @MooingDuck 宽字符 API 使用可能与 Unicode 无关的 实现定义 固定宽度编码。它可以像在 Windows 上一样基于 UTF-16，效果是无法正确处理 BMP 之外的字符，或者它可以像在 Linux 上一样使用 UTF-32，这使得完整的 Unicode 支持成为可能。或者它可以使用完全不同的字符集。 @nwellnhof：我忘记了实现定义的宽字符是如何定义的。你是对的，对于 4 字节宽，是的，它们可以干净地处理所有 Unicode。但是对于 2 字节宽，没有可能的实现来处理所有 Unicode。【参考方案2】：

这是一个代码示例，您可以使用不同的语言环境进行实验，这样您就可以得到您想要的。您可以尝试使用 u16string、u32string 等。使用语言环境一开始有点混乱。大多数人使用 ASCII 编程。

在你的主函数中调用我写的那个

#include <iostream>
#include <string>
#include <codecvt>
#include <sstream>
#include <locale>

wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~ Πυθαγόρας ὁ Σάμιος";
removeNonAlpha(test);


wstring removeNonAlpha(const wstring &input) 
   typedef codecvt<wchar_t, char, mbstate_t> Cvt;
   locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
   wcout.imbue(utf8locale);
   wcout << input << endl;
   wstring res;
   std::locale loc2("en_US.UTF8");
   for(wstring::size_type l = 0; l<input.size(); l++) 
      if(isalpha(input[l], loc2) || isspace(input[l], loc2)) 
         cout << "is char\n";
         res += input[l];
      
      else 
         cout << "is not char\n";
      
   
   wcout << L"Hello, wide to multybyte world!" << endl;
   wcout << res << endl;
   cout << std::isalpha(L'Я', loc2) << endl;
   return res;

【讨论】：

wchar_t 不能保证足够宽以表示 Unicode 代码点。在 Windows 上，它是 16 位的，代表 UTF-16 代码单元，而不是代码点。

以上是关于清除非字母字符的字符串ma的主要内容，如果未能解决你的问题，请参考以下文章