用 .encode 和 utf8 读取行 [重复]

Posted 2023-02-23

技术标签:

【中文标题】用 .encode 和 utf8 读取行 [重复]【英文标题】：read line with .encode with utf8 [duplicate] 【发布时间】：2016-10-12 21:53:22 【问题描述】：

我从如下文件中读取行：

The Little Big Things：163 Wege zur Spitzenleistung (Dein Leben)（德语版）（Peters, Tom）

死亡美德灾难：So führen Sie Teams über Distanz zur Spitzenleistung (德文版) (Thomas, Gary)

我阅读/编码它们：

title = line.encode('utf8')

但输出是：

b'Die meadowlle Katastrophe: So f\xc3\xbchren Sie Teams \xc3\xbcber Distanz zur Spitzenleistung (德文版) (Thomas, Gary)'

b'The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben) （德文版）（彼得斯，汤姆）'

为什么总是加上“b”？如何正确读取文件以保留“元音变音”？

这里是完整的相关代码sn-p：

# Parse the clippings.txt file
lines = [line.strip() for line in codecs.open(config['CLIPPINGS_FILE'], 'r', 'utf-8-sig')]
for line in lines:
    line_count = line_count + 1
    if (line_count == 1 or is_title == 1):
        # ASSERT: this is a title line
        #title = line.encode('ascii', 'ignore')
        title = line.encode('utf8')
        prev_title = 1
        is_title = 0
        note_type_result = note_type = l = l_result = location = ""
        continue

谢谢

【问题讨论】：

b''意味着你得到了一个字节缓冲区，而不是encode()所期望的（unicode）字符串，它将字符串转换为编码的字节序列。在您的情况下，您需要 decode() from utf-8，而不是编码 to utf-8。或者更好的是，使用codecs.open(..., encoding='utf-8')。不过，为了获得正确的答案，我希望看到更多您的代码。 @dhke 删除 .encode 行可能就足够了，因为输出看起来像正确的 UTF-8，这意味着 line 已经是一个有效的 Unicode 字符串。 @f0rd42 我明白了。并且查看 sn-p，您应该能够简单地完全删除编码部分。此时，line 已经是一个（解码的）Python 字符串。 '\xc3\xb' 对于德语 ü 也是正确的 utf-8。是什么让您想到，变音符号读不正确？它们在输出中显示不正确吗？ @melpomene AttributeError: 'str' object has no attribute 'decode' ;-)。它是 Python 3，而 Python 3 字符串没有 decode()，因为它已经解码。只要做一个“title = line”就可以满足我的所有需求。我将代码视为满足我需求的基础。谢谢你们俩 【参考方案1】：

str.encode 方法将 unicode 字符串转换为 bytes 对象：

str.encode(encoding="utf-8", errors="strict") 将字符串的编码版本作为字节对象返回。默认编码为“utf-8”。可能会给出错误以设置不同的错误处理方案。错误的默认值为“严格”，这意味着编码错误会引发 UnicodeError。其他可能的值是 'ignore'、'replace'、'xmlcharrefreplace'、'backslashreplace' 和通过 codecs.register_error() 注册的任何其他名称，请参阅错误处理程序部分。有关可能的编码列表，请参阅标准编码部分。

所以你得到的正是预期的。

在大多数机器上，您只需open 文件并读取。如果文件编码不是系统默认的，您可以将其作为关键字参数传递：

with open(filename, encoding='utf8') as f:
    line = f.readline()

【讨论】：

以上是关于用 .encode 和 utf8 读取行 [重复]的主要内容，如果未能解决你的问题，请参考以下文章