利用chardet检测网页编码
Posted roucheng
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了利用chardet检测网页编码相关的知识,希望对你有一定的参考价值。
环境:Win7_x64 + python3.4.3
需要先下载chardet并进行安装,下载地址:https://pypi.python.org/packages/source/c/chardet/chardet-2.3.0.tar.gz
安装:进入解压后的目录,在命令窗口执行: Python setup.py install
写个测试的python脚本吧(DetectURLCoding.py):
#coding:utf-8 \'\'\'\'\'python 3.x\'\'\' import sys import urllib.request import chardet # 将data写入文件fname def writeFile(fname, data): f = open(fname, "wb") if f: f.write(data) f.close() def blog_detect(blogurl): \'\'\'\'\'检测编码方式\'\'\' try: fp = urllib.request.urlopen(blogurl) except Exception as e: print(e) print(\'download exception-[%s]\' %blogurl) return 0 blog = fp.read() # python3.x read the html as html code bytearray fp.close() #writeFile("t.html", blog) # get encoding string codedetect = chardet.detect(blog)[\'encoding\'] print(\'%s <- %s\' %(blogurl, codedetect)) return 1 if __name__==\'__main__\': if len(sys.argv) == 1: print(\'\'\'\'\'usage: python DetectURLCoding.py http://xxx.com\'\'\') else: v = blog_detect(sys.argv[1]) print(v) # 何问起 hovertree.com
运行结果:
D:\\profile\\Desktop>PYTHON de.py http://hovertree.com/
http://hovertree.com/ <- utf-8
1
D:\\profile\\Desktop>PYTHON de.py http://photo.cankaoxiaoxi.com/roll10/2015/0318/709734.shtml
http://photo.cankaoxiaoxi.com/roll10/2015/0318/709734.shtml <- utf-8
1
以上是关于利用chardet检测网页编码的主要内容,如果未能解决你的问题,请参考以下文章