如何过滤（或替换）在 UTF-8 中占用超过 3 个字节的 unicode 字符？

Posted 2023-02-23

技术标签:

【中文标题】如何过滤（或替换）在 UTF-8 中占用超过 3 个字节的 unicode 字符？【英文标题】：How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8? 【发布时间】：2011-03-14 07:09:26 【问题描述】：

我正在使用 Python 和 Django，但由于 mysql 的限制，我遇到了问题。根据MySQL 5.1 documentation，他们的utf8 实现不支持4 字节字符。 MySQL 5.5 将支持使用 utf8mb4 的 4 字节字符；并且，在未来的某一天，utf8 也可能会支持它。

但是我的服务器还没有准备好升级到 MySQL 5.5，因此我被限制为占用 3 个字节或更少的 UTF-8 字符。

我的问题是：如何过滤（或替换）占用超过 3 个字节的 unicode 字符？

我想用官方\ufffd（U+FFFD 替换字符）或?替换所有4字节字符。

换句话说，我想要一个与 Python 自己的 str.encode() 方法（传递 'replace' 参数时）非常相似的行为。 编辑：我想要类似于encode() 的行为，但我不想实际对字符串进行编码。我希望过滤后仍然有一个 unicode 字符串。

我不想在存储到 MySQL 之前转义字符，因为这意味着我需要对从数据库中获取的所有字符串进行转义，这非常烦人且不可行。

另见：

"Incorrect string value" warning when saving some unicode characters to MySQL（在 Django 票务系统中） ‘????’ Not a valid unicode character, but in the unicode character set?（在堆栈溢出）

[编辑] 添加了有关建议解决方案的测试

所以到目前为止我得到了很好的答案。谢谢，人们！现在，为了选择其中一个，我做了一个快速测试，以找到最简单和最快的一个。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et

import cProfile
import random
import re

# How many times to repeat each filtering
repeat_count = 256

# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90

# Total number of characters in this string
string_size = 8 * 1024

# Generating a random testing string
test_string = u''.join(
        unichr(random.randrange(32,
            0x10ffff if random.randrange(100) > normal_chars else 0x0fff
        )) for i in xrange(string_size) )

# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

def filter_using_re(unicode_string):
    return re_pattern.sub(u'\uFFFD', unicode_string)

def filter_using_python(unicode_string):
    return u''.join(
        uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
        for uc in unicode_string
    )

def repeat_test(func, unicode_string):
    for i in xrange(repeat_count):
        tmp = func(unicode_string)

print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')

#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')

结果：

filter_using_re() 在 0.139 CPU 秒 内完成了 515 次函数调用（在 sub() 内置时为 0.138 CPU 秒） filter_using_python() 在 3.413 CPU 秒 内完成了 2097923 次函数调用（join() 调用时 1.511 CPU 秒和评估生成器表达式的 1.900 CPU 秒）我没有使用 itertools 进行测试，因为……嗯……那个解决方案虽然很有趣，但相当庞大和复杂。

结论

RegEx 解决方案是迄今为止最快的解决方案。

【问题讨论】：

【参考方案1】：

在 \u0000-\uD7FF 和 \uE000-\uFFFF 范围内的 Unicode 字符在 UTF8 中将具有 3 个字节（或更少）的编码。 \uD800-\uDFFF 范围用于多字节 UTF16。我不知道python，但你应该能够设置一个正则表达式来匹配这些范围之外。

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

在问题正文中编辑从 Denilson Sá 的脚本添加 Python：

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)

【讨论】：

注意"[^\u0000-\uFFFF]"等字符串不是原始字符串，即字符串字面量没有前缀r! 我不得不将u'[^\u0000-\uD7FF\uE000-\uFFFF]' 中的第一个范围结尾从'\uD7FF' 更改为'\u07FF'，因为那里还有一些字符没有被清除。【参考方案2】：

您可以跳过解码和编码步骤，直接检测每个字符的第一个字节（8位字符串）的值。根据 UTF-8：

#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

据此，您只需要检查每个字符的第一个字节的值即可过滤掉4字节的字符：

def filter_4byte_chars(s):
    i = 0
    j = len(s)
    # you need to convert
    # the immutable string
    # to a mutable list first
    s = list(s)
    while i < j:
        # get the value of this byte
        k = ord(s[i])
        # this is a 1-byte character, skip to the next byte
        if k <= 127:
            i += 1
        # this is a 2-byte character, skip ahead by 2 bytes
        elif k < 224:
            i += 2
        # this is a 3-byte character, skip ahead by 3 bytes
        elif k < 240:
            i += 3
        # this is a 4-byte character, remove it and update
        # the length of the string we need to check
        else:
            s[i:i+4] = []
            j -= 4
    return ''.join(s)

跳过解码和编码部分将为您节省一些时间，并且对于大多数具有 1 字节字符的较小字符串，这甚至可能比正则表达式过滤更快。

【讨论】：

【参考方案3】：

只是为了好玩，itertools 怪物 :)

import itertools as it, operator as op

def max3bytes(unicode_string):

    # sequence of pairs of (char_in_string, u'\NREPLACEMENT CHARACTER')
    pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))

    # is the argument less than or equal to 65535?
    selector= ft.partial(op.le, 65535)

    # using the character ordinals, return 0 or 1 based on `selector`
    indexer= it.imap(selector, it.imap(ord, unicode_string))

    # now pick the correct item for all pairs
    return u''.join(it.imap(tuple.__getitem__, pairs, indexer))

【讨论】：

【参考方案4】：

编码为 UTF-16，然后重新编码为 UTF-8。

>>> t = u'???'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'

请注意，您不能在加入后进行编码，因为代理对可能在重新编码之前被解码。

编辑：

MySQL（至少 5.1.47）处理代理对没有问题：

mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)

  ...

>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'???'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
???

【讨论】：

或许struct.unpack('<%dH' % (len(e)//2), e)? (1) 我提到的 MySQL 文档将字符集声明为列定义的一部分：t character(128) character set utf8 ...您确定您拥有的内容是等价的吗？ (2) 用 Python 3.1 试试你的 UTF-16 特技 :-) @John: (1) 在 2.6 上用 character set utf8 重新测试。结果是一样的。 (2) 这只是股票 UTF-8 编解码器的限制。它可以使用自定义编解码器来解决。或者让 MySQL 一开始就做正确的事情。【参考方案5】：

根据the MySQL 5.1 documentation：“ucs2 和 utf8 字符集不支持位于 BMP 之外的补充字符。”这表明代理对可能存在问题。

请注意，Unicode standard 5.2 chapter 3 实际上禁止将代理对编码为两个 3 字节的 UTF-8 序列，而不是一个 4 字节的 UTF-8 序列...参见例如第 93 页“”“因为代理代码点是不是 Unicode 标量值，否则将映射到代码点 D800..DFFF 的任何 UTF-8 字节序列都是格式错误的。""" 但是，据我所知，这个禁令在很大程度上是未知或被忽略的。

检查 MySQL 对代理对的作用可能是个好主意。如果不保留它们，此代码将提供一个足够简单的检查：

all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)

并且此代码将用u\ufffd 替换任何“讨厌的东西”：

u''.join(
    uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
    for uc in unicode_string
    )

【讨论】：

“然而，据我所知，这项禁令在很大程度上是未知或被忽视的。”——希望不是！至少 Python 3 拒绝编码代理代码点（尝试chr(55349).encode("utf-8")）。 @Philipp：Python 3 似乎确实做了“正确的事情”——但是您的示例是 LONE 代理，这是一个不同的问题； Python 2 通过了该测试，但没有通过这个测试："\xed\xa0\x80\xed\xb0\x80".decode('utf8') 产生 u'\U00010000' 而不是异常。嗯...您忘记在所有字符串中添加 u 前缀！应该是u'\ufffd'。 ;)【参考方案6】：

我猜这不是最快的，但很简单（“pythonic”:)：

def max3bytes(unicode_string):
    return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)

注意：此代码没有考虑到 Unicode 在 U+D800-U+DFFF 范围内具有代理字符这一事实。

【讨论】：

也许它应该排除代理。另外：uc <= u'\uffff' 可能比ord(uc) < 65536 更好【参考方案7】：

这不仅仅是过滤掉 3+ 字节的 UTF-8 unicode 字符。它会删除 unicode，但会尝试以温和的方式执行此操作，并尽可能将其替换为相关的 ASCII 字符。如果您的文本中没有十几个不同的 unicode 撇号和 unicode 引号（通常来自 Apple 手持设备），而只有常规的 ASCII 撇号和引号，那么将来可能会是一件好事。

unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")

这很健壮，我将它与更多警卫一起使用：

import unicodedata

def neutralize_unicode(value):
    """
    Taking care of special characters as gently as possible

    Args:
        value (string): input string, can contain unicode characters

    Returns:
        :obj:`string` where the unicode characters are replaced with standard
        ASCII counterparts (for example en-dash and em-dash with regular dash,
        apostrophe and quotation variations with the standard ones) or taken
        out if there's no substitute.
    """
    if not value or not isinstance(value, basestring):
        return value

    if isinstance(value, str):
        return value

    return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")

这是 Python 2 顺便说一句。

【讨论】：

以上是关于如何过滤（或替换）在 UTF-8 中占用超过 3 个字节的 unicode 字符？的主要内容，如果未能解决你的问题，请参考以下文章