获取 boost::locale::conv 中函数的用户代码页名称

Posted 2023-02-21

技术标签:

【中文标题】获取 boost::locale::conv 中函数的用户代码页名称【英文标题】：Get the user's codepage name for functions in boost::locale::conv 【发布时间】：2016-06-29 18:49:58 【问题描述】：

手头的任务

我在 Windows 上从 UTF-8 编码的 XML 解析文件名。我需要将该文件名传递给我无法更改的函数。它在内部使用不支持 Unicode 字符串的_fsopen()。

目前的做法

我目前的方法是将文件名转换为用户的字符集，希望文件名可以用该编码表示。然后我使用 boost::locale::conv::from_utf() 从 UTF-8 进行转换，并使用 boost::locale::util::get_system_locale() 获取当前语言环境的名称。

生活美好吗？

我在使用代码页 Windows-1252 的德语系统上，因此 get_system_locale() 正确生成 de_DE.windows-1252。如果我使用包含变音符号的文件名测试该方法，一切都会按预期工作。

问题

只是为了确保我 switched my system locale 到使用代码页 Windows-1251 的乌克兰语。在文件名中使用一些西里尔字母我的方法失败了。原因是 get_system_locale() 仍然产生 de_DE.windows-1252 现在不正确。

另一方面，GetACP() 正确地为德国语言环境生成 1252，为乌克兰语言环境生成 1251。我也知道 Boost.Locale 可以转换为给定的语言环境，因为这个小型测试程序可以按我的预期工作：

#include <boost/locale.hpp>
#include <iostream>
#include <string>
#include <windows.h>

int main()

    std::cout << "Codepage: " << GetACP() << std::endl;
    std::cout << "Boost.Locale: " << boost::locale::util::get_system_locale() << std::endl;

    namespace blc = boost::locale::conv;
    // Cyrillic small letter zhe -> \xe6 (ш on 1251, æ on 1252)
    std::string const test1251 = blc::from_utf(std::string("\xd0\xb6"), "windows-1251");
    std::cout << "1251: " << static_cast<int>(test1251.front()) << std::endl;
    // Latin small letter sharp s -> \xdf (Я on 1251, ß on 1252)
    auto const test1252 = blc::from_utf(std::string("\xc3\x9f"), "windows-1252");
    std::cout << "1252: " << static_cast<int>(test1252.front()) << std::endl;

问题

如何以 Boost.Locale 支持的格式查询用户语言环境的名称？使用std::locale("").name() 会产生German_Germany.1252，使用它会导致boost::locale::conv::invalid_charset_error 异常。

系统区域设置是否可能保持 de_DE.windows-1252 尽管我应该将其更改为本地管理员？同样，系统语言是德语，尽管我的帐户的语言是英语。（在我登录之前，登录屏幕是德语）

我应该坚持使用using short filenames吗？但似乎不能可靠地工作。

细则

编译器为 MSVC18 Boost 是 1.56.0 版本，后端应该是 winapi 系统为Win7，系统语言为德语，用户语言为英语

【问题讨论】：

【参考方案1】：

ANSI 已被弃用，所以不要打扰它。

Windows 使用 UTF16，您必须使用 MultiByteToWideChar 从 UTF8 转换为 UTF16。这种转换是安全的。

std::wstring getU16(const std::string &str)

    if (str.empty()) return std::wstring();
    int sz = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), 0, 0);
    std::wstring res(sz, 0);
    MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &res[0], sz);
    return res;

然后您使用_wfsopen（来自您提供的链接）打开带有 UTF16 名称的文件。

int main()

    //UTF8 source:
    std::string filename_u8;

    //This line works in VS2015 only
    //For older version comment out the next line, obtain UTF8 from another source
    filename_u8 = u8"c:\\test\\__ελληνικά.txt";

    //convert to UTF16
    std::wstring filename_utf16 = getU16(filename_u8);

    FILE *file = NULL;
    _wfopen_s(&file, filename_utf16.c_str(), L"w");
    if (file)
    
        //Add BOM, optional...

        //Write the file name in to file, for testing...
        fwrite(filename_u8.data(), 1, filename_u8.length(), file);

        fclose(file);
    
    else
    
        cout << "access denined, or folder doesn't exits...
    

    return 0;

编辑，从 UTF8 获取 ANSI，使用 GetACP()

std::wstring string_to_wstring(const std::string &str, int codepage)

    if (str.empty()) return std::wstring();
    int sz = MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), 0, 0);
    std::wstring res(sz, 0);
    MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), &res[0], sz);
    return res;


std::string wstring_to_string(const std::wstring &wstr, int codepage)

    if (wstr.empty()) return std::string();
    int sz = WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), 0, 0, 0, 0);
    std::string res(sz, 0);
    WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), &res[0], sz, 0, 0);
    return res;


std::string get_ansi_from_utf8(const std::string &utf8, int codepage)

    std::wstring utf16 = string_to_wstring(utf8, CP_UTF8);
    std::string ansi = wstring_to_string(utf16, codepage);
    return ansi;

【讨论】：

假设我无法更改功能，这很遗憾没有帮助。但这仍然很有希望，因为界面太稳定而无法更改，但功能只需要努力。谢谢，我试试看。您描述的问题相当复杂，这也是 Unicode 最初被发明的部分原因。我添加了一个从 UTF8 获取 ANSI 的函数，这有点像 Iverelo 建议的。另请参阅有关系统语言的link，但我不确定这是否有帮助。幸运的是，我设法采纳了您的第一个建议。它允许保持稳定的界面但改变内部结构。不过，我使用了boost::locale::conv::utf_to_utf<wchar_t>() 而不是你的getU()。不幸的是，实际问题仍然没有答案。但是您现在提供了一种解决提升限制的方法，所以我会接受这个答案。【参考方案2】：

Barmak 的方式是最好的方式。

要清除语言环境的内容，该过程总是从“C”语言环境开始。您可以使用setlocale function 将语言环境设置为系统默认设置或任意语言环境。

#include <clocale>

// Get the current locale
setlocale(LC_ALL,NULL);

// Set locale to system default
setlocale(LC_ALL,"");

// Set locale to German
setlocale(LC_ALL,"de-DE");

【讨论】：

感谢您的回答。问题是转换函数的语言环境重载不适用于标准语言环境。即使在剥离 language_territory 部分时，charset-as-string 重载也会因这些语言环境的名称而失败。你说的转换函数还是boost函数？不幸的是，我在提升语言环境功能方面没有很多经验。我过去在 Windows 上从一种编码转换到另一种编码的技巧是使用 MultiByteToWideChar 获取宽字符，然后使用 WideCharToMultiByte 回到不同的编码。是的，仍然是增强功能。我猜在内部他们也是这样做的，关键问题是如何将 MS uint 代码页 ID 映射到提升代码页字符串。

以上是关于获取 boost::locale::conv 中函数的用户代码页名称的主要内容，如果未能解决你的问题，请参考以下文章