在python中设置隐式默认编码解码错误处理

Posted 2021-04-04

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了在python中设置隐式默认编码解码错误处理相关的知识，希望对你有一定的参考价值。

我正在处理用latin1编码的外部数据。所以我添加了sitecustomize.py并在其中添加

sys.setdefaultencoding('latin_1')

果然，现在使用latin1字符串工作正常。

但是，如果我遇到一些未在latin1中编码的内容：

s=str(u'abcu2013')

我得到UnicodeEncodeError: 'latin-1' codec can't encode character u'u2013' in position 3: ordinal not in range(256)

我想要的是，不可解码的字符会被忽略，即我会在上面的例子s=='abc?'中得到它，并且每次都没有明确地调用decode()或encode，即不是s.decode（...，'replace'））每次通话。

我尝试用codecs.register_error做不同的事情，但无济于事。

请帮忙？

答案

脚本无法调用sys.setdefaultencoding是有原因的。不要这样做，一些库（包括Python包含的标准库）期望默认为'ascii'。

相反，在读入程序时（通过文件，标准输入，套接字等）将字符串显式解码为Unicode，并在写出字符串时对字符串进行显式编码。

显式解码采用指定不可解码字节行为的参数。

另一答案

您可以定义自己的自定义处理程序，并使用它来代替您。看这个例子：

import codecs
from logging import getLogger

log = getLogger()

def custom_character_handler(exception):
    log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
            exception.reason,
            exception.object[exception.start:exception.end],
            exception.encoding,
            exception.start,
            exception.end )
    return ("?", exception.end)

codecs.register_error("custom_character_handler", custom_character_handler)

print( b'Fxc3xb8xc3xb6xbbBxc3xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abcu03c0de", "ascii", "custom_character_handler") )

运行它，你会看到：

invalid start byte for b'xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'

参考文献：

以上是关于在python中设置隐式默认编码解码错误处理的主要内容，如果未能解决你的问题，请参考以下文章