Python 进阶 — 字符串编码（encode）与解码（decode）

Posted 2021-12-14 范桂飓

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python 进阶 — 字符串编码（encode）与解码（decode）相关的知识，希望对你有一定的参考价值。

基本概念

bit（比特）：计算机中最小的数据单位。
byte（字节）：计算机存储数据的单元。
char（字符）：人类能够识别的符号。
string（字符串）：由 char 组成的字符序列。
bytecode（字节码）：以 byte 的形式存储 char 或 string。
encode（编码）：将人类可识别的 char 或 string 转换为机器可识别的 bytecode。存在多种转换格式，例如：Unicode、ASCII、UTF-8、GBK 等类型。
decode（解码）：encode 的反向过程。

Python 的字符串

Python 具有两种不同的 String，一种存储文本，一种存储字节。

P2 默认的编码格式是 ASCII，但因为 ASCII 只支持数百个字符，不能灵活支持中文等非英文字符，所以 P2 同时还支持了 Unicode 这种更强大的编码格式。但由于 P2 同时支持了两套编码格式，就难免多出了一些 encode/decode 的麻烦。

为此，P3 则统一使用了 Unicode 编码格式，带来了很大的开发便利。

P2：

对于文本：采用 Unicode 存储。
对于字节：采用原始字节序列或者 ASCII 存储。

P3：
3. 对于文本：采用 Unicode 存储，被命名为 str。
4. 对于字节：采用 Unicode 存储，被命名为 bytes。

所以，P2 和 P3 的 build-in str() 也不相同：

P2 str()：

elp on class str in module __builtin__:

class str(basestring)
 |  str(object='') -> string
 |
 |  Return a nice string representation of the object.
 |  If the argument is a string, the return value is the same object.

P3 str()：

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.

Python 的编码（encode）与解码（decode）

由于，P3 的 string 均为 unicode 编码，因此在做 encode/decode 转换时，会以 unicode 作为中间编码，即：先将其他编码的字符串解码（decode）成 unicode，再从 unicode 编码（encode）成另一种编码。

编码（encode）：将 unicode str 转换为特定编码格式的 bytecode 并存储，例如：将 unicode str1 转换成 gb2312 bytecode。
解码（decode）：将特定编码格式的 bytecode 转换为 unicode str 的过程，例如：将 gb2312 bytecode 换成 unicode str2。

举例来说：

当我们用 VIM 编辑器打开一个 .py 文件，输入代码 a = 123，那么这个 a = 123 就是一个 unicode str。当我们保存文件后，这个 str 就会根据 VIM 的设置被转换为对应的编码格式（e.g. utf8）的 bytecode 保存到系统的硬盘，这是一个 encode 过程；
然后，当 Python 解释器执行 .py 文件时，先将 bytecode 按照指定的编码格式 decode 为 unicode str，然后运行程序，这是一个 decode 过程。

>>> '美丽人生'.encode('gbk')
b'\\xc3\\xc0\\xc0\\xf6\\xc8\\xcb\\xc9\\xfa'
>>> b'\\xc3\\xc0\\xc0\\xf6\\xc8\\xcb\\xc9\\xfa'.decode('gbk')
'美丽人生'
>>> '美丽人生'.encode('utf-8')
b'\\xe7\\xbe\\x8e\\xe4\\xb8\\xbd\\xe4\\xba\\xba\\xe7\\x94\\x9f'
>>> b'\\xe7\\xbe\\x8e\\xe4\\xb8\\xbd\\xe4\\xba\\xba\\xe7\\x94\\x9f'.decode('utf-8')
'美丽人生'
>>> b'\\xc3\\xc0\\xc0\\xf6\\xc8\\xcb\\xc9\\xfa'.decode('gbk').encode('utf-8')
b'\\xe7\\xbe\\x8e\\xe4\\xb8\\xbd\\xe4\\xba\\xba\\xe7\\x94\\x9f'

上述的 b’str’ 即为 bytecode，一个斜杠就是一个 byte。可见，一个常用汉字用 GBK 格式编码后占 2byte，用 UTF-8 格式编码后占 3byte。

在某些 Terminal 或 Console 中，String 的输出总是出现乱码，甚至错误，其实是由于 Terminal 或 Console 自身不能 decode 该 encode 类型的 string。

例如：

#-*-coding:utf-8-*-  # 指定文件的 default coding（encode/decode）均为为 utf8

s1='中文'
print type(s1)       # 以 utf8 格式进行 str1 的编解码
print s1

s2='中文'
s2.encode('gb2312')  # 强制将 utf8 str2 编码为 gb2312 bytecode
print type(s2)
print s2

Output：

<type 'str'>
中文

Traceback (most recent call last):
  File "test1.py", line 13, in <module>
    s.encode('gb2312')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

出现 UnicodeDecodeError 的原因是，当 print str 时，会隐式的调用 str() 进行 utf8 decode，如果 encode 和 decode 都是 utf8，那么可以正常输出。否者，s.encode 为 gb2312 但却以 utf8 decode 的话，就会出现 decode 异常。

Python 程序可以通过 #-*-coding:utf-8-*- 来指定文件的编码格式，也可以全局修改系统默认的编码类型：

import sys

reload(sys)
sys.setdefaultencoding('utf8')

以上是关于Python 进阶 — 字符串编码（encode）与解码（decode）的主要内容，如果未能解决你的问题，请参考以下文章

转载Python的编码处理

编码进阶

Python进阶02python编码问题

python的encode()和decode()函数

python字符串的encode和decode

Python字符串的编码与解码(encode与decode)

Python 进阶 — 字符串编码（encode）与解码（decode）

目录

文章目录

基本概念

Python 的字符串

Python 的编码（encode）与解码（decode）