关于chardet的问题
Posted zmiao
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了关于chardet的问题相关的知识,希望对你有一定的参考价值。
1. 在得到一份网页请求的response中还有一个文件名字.
file_name = b‘xbaxe3xcbxb3xd6xdax95N(300208)_xcfxd6xbdxf0xc1xf7xc1xbfxb1xed.xls‘
然后利用chardet.detect来获取编码方式,得到的是‘GB2312‘,但是使用这个编码方式来解码,失败了.
>>> s.decode(‘GB2312‘) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: ‘gb2312‘ codec can‘t decode byte 0x95 in position 6: illegal multibyte sequence
然后查询对应汉字的编码值x95x4e,并且在https://bianma.supfree.net/chaye.asp?id=6607,得到使用的是‘GBK’编码.
是‘GB2312‘的超集.使用‘GBK‘解码.结果正常不再出错.
>>> s.decode(‘gbk‘, errors=‘ignore‘) ‘恒顺众昇(300208)_现金流量表.xls‘
2.如果上面的情况还可以接受的话,那下面这个就有点不合理了.
>>> file_bname=b‘xc2xf5xc8xf0xd2xbdxc1xc6(300760)_xc0xfbxc8xf3xb1xed.xls‘ >>> chardet.detect(file_bname)[‘encoding‘] >>> print(chardet.detect(file_bname)[‘encoding‘]) None >>> file_bname.decode(‘gbk‘) ‘迈瑞医疗(300760)_利润表.xls‘ >>> file_bname.decode(‘gb2312‘) ‘迈瑞医疗(300760)_利润表.xls‘ >>>
可以看到代码在python shell中得不到编码方式. 但是在scrapy中得到的是如下编码方式.
2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: CP932 Japanese prober hit error at byte 43 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-TW Taiwan prober hit error at byte 27 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: utf-8 not active 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: CP932 not active 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-JP Japanese confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: GB2312 Chinese confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-KR Korean confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: CP949 Korean confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: Big5 Chinese confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: EUC-TW not active 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Russian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: KOI8-R Russian confidence = 0.11814918824024898 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Russian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: MacCyrillic Russian confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM866 Russian confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM855 Russian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-7 Greek confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1253 Greek confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Bulgairan confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Bulgarian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: TIS-620 Thai confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-9 Turkish confidence = 0.35989894691932234 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Russian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: KOI8-R Russian confidence = 0.11814918824024898 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Russian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: MacCyrillic Russian confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM866 Russian confidence = 0.01 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: IBM855 Russian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-7 Greek confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1253 Greek confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-5 Bulgairan confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1251 Bulgarian confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: TIS-620 Thai confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: ISO-8859-9 Turkish confidence = 0.35989894691932234 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0 2018-12-16 15:28:05 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0 encoding: ISO-8859-9
导致decode出现的是乱码. 尽管这部分代码是用‘GBK‘编码的.
通过以上这两个例子可以看出, chardet这个module在判断上还是会出现不少偏差. 实际中还是需要注意.
以上是关于关于chardet的问题的主要内容,如果未能解决你的问题,请参考以下文章
RequestsDependencyWarning: urllib3 (1.26.4) or chardet (4.0.0) doesn‘t match a supported version(代码片