错误的unicode字符串，看起来一样但本质不一样

Posted 2023-02-24

技术标签:

【中文标题】错误的unicode字符串，看起来一样但本质不一样【英文标题】：Error unicode string, look the same but the essence is not the same 【发布时间】：2021-09-15 21:03:31 【问题描述】：

Eror text unicode

当我使用 CURL 来获取网站的内容时，我得到的内容看起来一样，但实际情况却不同。这会影响文档的处理和比较。有没有办法将 $loi 转换为 $check 标准格式以便我可以正确处理？

你可以将内容$loi或$check复制到cmd窗口中，立即看到如图所示的不同

$loi = 'người được tiêm';
    $check = 'người được tiêm';
    var_dump($loi);
    var_dump($check);

【问题讨论】：

【参考方案1】：

有些代码点在 Unicode 中看起来相同，因为它们具有相似甚至相同字素，但实际上不相同。

当您意识到 Unicode 的目的是容纳尽可能多的语言时，这应该不足为奇，它通常表示字母的用途，而不是形式。

例如U+2010 和U+2011（连字符和不间断连字符）可能看起来完全相同，因为后者只是前者的不间断版本。

如果你将你的两个字符串输入the Unicode to code points converter，你会看到不同。

为简洁起见，我只完成了每个单词的第一个单词，并给出了十六进制代码点，每个“字符”周围都有方括号：

người [6e] [67] [75 31b] [6f 31b 300] [69]
người [6e] [67] [1b0]    [1edd]       [69]

例如，第一个中的ư 是75 31b，即Latin small letter U 后跟combining horn（字母的修饰符）。在第二个中，它是单个 1b0、Latin small letter U with horn（已内置在代码点中的修饰符）。

同样，ờ 在第一个中是 6f 31b 300，三个单独的代码点分别代表 Latin small letter O、combining horn 修饰符和 combining grave accent 修饰符。第二个是1edd，两个修饰符都已合并到单个代码点Latin small letter O with horn and grave。

因此，在这些情况下，它与字形实际上并没有不同的 intent，而是是一种不同的表示方式：

具有内置修饰符的单个代码点；或带有单独附加修饰符代码点的代码点。

如果您需要对它们相同，Unicode 有equivalence 和normalisation 的概念。

等价表示多个代码点序列实际上是同一“事物”的变体，规范化是将等价物映射到单个变体的过程，以便比较。

在 Python 中，我会使用以下方法来映射一种或另一种方式：

import unicodedata
normalised_composed = unicodedata.normalize('NFC', 'người'))
normalised_decomposed = unicodedata.normalize('NFD', 'người'))
# Composed is short sequence (minimal codepoints), decomposed is long.

以下记录显示了输出，但为了便于阅读，我已重新格式化和注释：

>>> bytearray('người', 'utf-16')
bytearray(b'\xff\xfe                # Unicode BOM for UTF-16.
    n\x00                           # n.
    g\x00                           # g.
    u\x00 \x1b\x03                  # u, combining horn.
    o\x00 \x1b\x03 \x00\x03         # o, combining horn & grave.
    i\x00                           # i.
')

>>> bytearray(unicodedata.normalize('NFD', 'người'), 'utf-16')
bytearray(b'\xff\xfe                # Identical to previous, it
    n\x00                           #   was already decomposed.
    g\x00
    u\x00 \x1b\x03
    o\x00 \x1b\x03 \x00\x03
    i\x00
')

>>> bytearray(unicodedata.normalize('NFC', 'người'), 'utf-16')
bytearray(b'\xff\xfe                # BOM.
    n\x00                           # n.
    g\x00                           # g.
    \xb0\x01                        # Latin u with horn.
    \xdd\x1e                        # Latin o with horn & grave.
    i\x00                           # i.
')

我不完全确定你使用的是什么语言（目前没有标签）但是，如果它声称可以处理 Unicode，它应该有等效的功能来做到这一点（因此我仍然如果您稍后添加标签，请认为此答案很有用）。

只需在您选择的搜索引擎中搜索<your_language> unicode normalisation。

【讨论】：

感谢您的解释，如何将 $loi 转换为 $check 以便进行比较和处理 @cuongbn：在末尾添加了一个块来解决这个问题。感谢您的支持。我使用 php 并使用 php 中的 Normalizer 库来规范化字符串 Normalizer::normalize($loi, Normalizer::FORM_C)

以上是关于错误的unicode字符串，看起来一样但本质不一样的主要内容，如果未能解决你的问题，请参考以下文章

解决MySQL联表时出现字符集不一样

windows 编程 —— 宽字符集与 Unicode

UNICODE编码UTF-16 中的Endian（FE FF）和 Little Endian（FF FE）

python数据结构-序列之元组