C++ 中的 Unicode 字符串处理

Posted 2023-02-22

技术标签:

【中文标题】C++ 中的 Unicode 字符串处理【英文标题】：Unicode string handling in C++ 【发布时间】：2013-12-02 09:22:50 【问题描述】：

我经历了足够多的线程并发布了关于这个主题的帖子，但不知何故它并没有帮助我在我的代码中添加 unicode 支持。我有很简单的任务要做 - 读取 Unicode 文件（.txt 和 csv） - 使用一些分隔符（或“分隔的单词”）解析它并将单词作为标记存储在 2D 数组中 - 对其进行一些操作 - 存储这些字符串文本文件

我面临的问题是我的一些旧代码函数不兼容，我猜是因为我找不到替代品，或者我能够编译它们但没有生成输出。这段代码与 ASCII 配合得很好，但现在我需要 unicode 支持。

如果我得到示例源代码会很棒，不需要是整个大代码，但至少像如何获取 Unicode 文件解析它并将其存储在令牌中以及用于比较的函数等，

我在下面粘贴部分代码，我确实修改了一些东西，所以可能无法在第一次编译。

获取文本文件作为输入，例如profile.txt 是 unicode（UTF 16 - 基本上是中文和韩文）

// adding all std headers here


const int MAX_CHARS_PER_LINE = 4072;  
const int MAX_TOKENS_PER_LINE = 1;      
const wchar_t* const DELIMITER = L"\"";

class IntegrityCheck

    public:
        std::wstring Profile_Container[5000][4];
        void Profile_PRD_Parser();
;

 void IntegrityCheck::Profile_PRD_Parser()


std::wstring skip (L".exe");
std::wstring databoxtemp[1][1];
int a=-1;

// create a file-reading object
wifstream fin.open("profiles.txt");  //open a file
wofstream fout("out.txt");  // this dumps the parsing ouput 

// read each line of the file
while (!fin.eof())

    // read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE];

    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const wchar_t* token[MAX_TOKENS_PER_LINE] = ; // initialize to 0

    // parse the line
    token[0] = wcstok(buf, DELIMITER); // first token

    if (token[0]) // zero if line is blank
    

        for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
        
            oken[n] = wcstok(0, DELIMITER); // subsequent tokens

            if (!token[n]) break; // no more tokens

            std::wstring str2 =token[n];

            std::size_t found = str2.find(str);  //substring comparison

            if (found!=std::string::npos)   // if its exe then it writes in Dxout for same app name on new line
            
                a++;
                Profile_Container[a][0]=token[n];
                std::transform(Profile_Container[a][2].begin(), Profile_Container[a][2].end(), Profile_Container[a][2].begin(), ::tolower);  //convert all data to lower 

                fout<<Profile_Container[a][0]<<"\t"<<Profile_Container[a][1]<<"\t"<<Profile_Container[a][2]<<"\n"; //write to file
            

        
    



fout.close();
fin.close();


int main()

IntegrityCheck p1;
p1.Profile_PRD_Parser();

【问题讨论】：

有一个错字，这个词拼写为“Integrity”，而不是“Intigrity”。如果您已经使用using namespace std;，那么没有理由也使用std::cout; 等等来编写。您已经在使用整个 std 命名空间。只需删除using namespace std 行。它不会“添加所有标准头”。如果您知道它的作用，我不建议您使用它，但是该评论表明您不知道它的作用，因此我必须提出更强烈的建议，不要使用它。第一件事是删除每一个提及char。调用 getline 时不要转换为 char，使用 wcstok 而不是 strtok。 "现在我需要 unicode 支持。"不是一个很好的问题描述。你想对数据做什么？您希望如何对输入进行编码？这是什么平台？（wsomething 不会神奇地让东西“支持 Unicode”） 【参考方案1】：

快速查看您的代码，我看到的唯一变化是

const wchar_t* const DELIMITER = L"\"";

fin.getline(buf, MAX_CHARS_PER_LINE);

token[0] = wcstok(buf, DELIMITER);

std::transform(Profile_Container[a][2].begin(), Profile_Container[a][2].end(), Profile_Container[a][2].begin(), ::towlower);

不确定towlower 是否能够将每个 Unicode 字符转换为小写，但如果您的文本是中文和韩文，我想这不是什么大问题。

编辑

在装有 Visual Studio 2010 的 Windows 上需要以下内容

#include <codecvt>
#include <locale>

wifstream fin("profiles.txt", ios_base::binary);  //open a file
fin.imbue(std::locale(fin.getloc(),
   new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));

这对我来说适用于以 UTF-16“大端”（但不是小端）编码的文件。

您当前代码的唯一问题是文件读取（也许是我没有看过的写入）。一旦您可以将文件中的字符转换为字符串，就应该没问题了。

如果上述方法对您不起作用，那么我不确定。这个page 有血淋淋的细节。

【讨论】：

，谢谢约翰更具体，我会做出改变，看看它是怎么回事并更新到线程我在代码中进行了更正，它的编译但我遗漏了一些东西，基本上 getline 从 unicode 文件中获取该行，然后我尝试将其分解为标记（使用分隔符）但我看到 getline 得到二进制和henc中的所有内容都无法将缓冲区分解为令牌，并且比较失败并且我得到空白输出，然后我需要将其转换回ASCII吗？但它会丢失数据吗？那么我应该如何处理呢？我写的所有逻辑都牢记简单的 ASCII 字符串，现在它变得困难了。任何使它工作的建议都非常受欢迎或者您是否建议其他方法来执行此操作？我搜索但找不到有关 unicode 文本文件解析的相关文章或示例代码我现在真的被这个 unicode 卡住了，有人可以帮我吗？ @NeileshC 我试过你的代码，很惊讶它不起作用（对我来说）。问题是仅使用 wchar_t 不足以告诉编译器您的文件是 UTF-16。似乎没有任何完全独立于平台的方式来执行此操作，因此继续进行的方式取决于您的编译器等。我已经用一些对我有用的代码更新了上面的答案。【参考方案2】：

编译并运行的最终代码：

fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,std::codecvt_mode(std::little_endian|std::consume_header)>));
fout.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,std::codecvt_mode(std::little_endian|std::consume_header)>));


while (!fin.eof())


 wchar_t buf[MAX_CHARS_PER_LINE];

 fin.getline(buf, MAX_CHARS_PER_LINE);

 wchar_t* token[MAX_TOKENS_PER_LINE] = ;
token[0] = wcstok(buf, DELIMITER);


if (token[0]) // zero if line is blank

    int n = 0; 
    for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
    
        token[n] = wcstok(0, DELIMITER); // subsequent tokens


        if (!token[n]) break; // no more tokens

        std::wstring str2 =token[n];

        std::size_t found = str2.find(str);  //substring comparison

        if (found!=std::string::npos)   // if its exe then it writes in Dxout for same app name on new line
          
            a++;
            Profile_Container[a][0]=token[n];
            fout<<Profile_Container[a][0];

【讨论】：

以上是关于C++ 中的 Unicode 字符串处理的主要内容，如果未能解决你的问题，请参考以下文章