将 unicode 代码点转换为 utf-16

Posted 2023-02-24

技术标签:

【中文标题】将 unicode 代码点转换为 utf-16【英文标题】：Convert unicode codepoint to utf-16 【发布时间】：2021-06-15 03:49:51 【问题描述】：

在 Windows 上的 C++ 中，如何将 &#xhhhh; 形式的 xml 字符引用转换为 utf-16 little endian 字符串？

我在想如果 hhhh 部分是 4 个字符或更少，那么它是 2 个字节，适合一个 utf-16 字符。但是，这个wiki page has a table of character references 和底部附近的一些是 5 位十六进制数字，不适合两个字节。它们如何转换为 utf-16？

我想知道MultiByteToWideChar 函数是否能够完成这项工作。

我对大于 2 个字节的代码点如何转换为 utf-16 的理解不足！（或者就此而言，我不太确定如何将大于 1 字节的代码点转换为 utf-8，但这是另一个问题）。

谢谢。

【问题讨论】：

MultiByteToWideChar 完全不适合这项任务。相关：MultiByteToWideChar for Unicode code pages 1200, 1201, 12000, 12001. 将代码点转换为 UTF-16 的算法在 Wikipedia 上有描述，请参阅 UTF-16 @RemyLebeau 但这个问题中更大的问题是首先将每个字符串 &#xhhhh; 转换为代码点。完成此操作后，您的建议可能会有所帮助。 @MarkRansom 将 XML 字符引用解析为数字代码点值很简单。特别是如果您使用实际的 XML 解析器并让它为您完成工作 【参考方案1】：

Unicode 代码点 (UTF-32) 为 4 个字节宽，可以使用以下代码（我碰巧在附近）转换为 UTF-16character（和可能的代理）。

它没有经过大量测试，因此非常感谢您接受错误报告：

/**
 * Converts U-32 code point to UTF-16 (and optional surrogate)
 * @param utf32 - UTF-32 code point
 * @param utf16 - returned UTF-16 character
 * @return - The number code units in the UTF-16 char (1 or 2).
 */
unsigned utf32_to_utf16(char32_t utf32, std::array<char16_t, 2>& utf16)

    if(utf32 < 0xD800 || (utf32 > 0xDFFF && utf32 < 0x10000))
    
        utf16[0] = char16_t(utf32);
        utf16[1] = 0;
        return 1;
    

    utf32 -= 0x010000;

    utf16[0] = char16_t(((0b1111'1111'1100'0000'0000 & utf32) >> 10) + 0xD800);
    utf16[1] = char16_t(((0b0000'0000'0011'1111'1111 & utf32) >> 00) + 0xDC00);

    return 2;

【讨论】：

您可以考虑特别处理 0xd800 到 0xdfff 范围，因为这些可能是格式错误的输入。 @MarkRansom 是的，我想知道是否缺少错误检查（这是我很久以前写的）。但是再看一下***的文章，它说，尽管范围在技术上是糟糕的代码点，但很多软件还是允许它们......所以我将不得不考虑一下。如果代码点配对以生成有效的 UTF-16 字符，则它也可能不是格式错误的输入。例如，JSON 以这种方式编码，参见例如Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

以上是关于将 unicode 代码点转换为 utf-16的主要内容，如果未能解决你的问题，请参考以下文章

将带有 CRLF 行终止符的 Little-endian UTF-16 Unicode 英文文本文件转换为 Ascii 编码

C++11 字符转换 UTF-8 UTF-16 UTF-32 UNICODE 错误LINK2001

在 Powershell 中将 UFT-8 xml 转换为 Unicode 时，$encoding 属性值在输出 xml 中显示 bigEndianUnicode，我想要 UTF-16

Qt - 将 QString 转换为 Unicode QByteArray

unicode16与unicode32之间是啥转换关系

转: utf16编码格式（unicode与utf16联系）