python urllib和urllib3包

Posted 2020-11-07 qiuri2008

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python urllib和urllib3包相关的知识，希望对你有一定的参考价值。

urllib.request

urllib当中使用最多的模块,涉及请求，响应，浏览器模拟，代理，cookie等功能。

1. 快速请求

urlopen返回对象提供一些基本方法：

read 返回文本数据
info 服务器返回的头信息
getcode 状态码
geturl 请求的url

request.urlopen(url, data=None, timeout=10)
#url:  需要打开的网址
#data：Post提交的数据
#timeout：设置网站的访问超时时间

from urllib import request
import ssl
# 解决某些环境下报<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
ssl._create_default_https_context = ssl._create_unverified_context
url = \'https://www.jianshu.com\'
#返回<http.client.HTTPResponse object at 0x0000000002E34550>
response = request.urlopen(url, data=None, timeout=10)
#直接用urllib.request模块的urlopen()获取页面，page的数据格式为bytes类型，需要decode()解码，转换成str类型。
page = response.read().decode(\'utf-8\')

2.模拟PC浏览器和手机浏览器

需要添加headers头信息，urlopen不支持，需要使用Request

import urllib.request

url = \'https://www.jianshu.com\'
# 增加header
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36\'
}
request = urllib.request.Request(url,headers=headers)
response = urllib.request.urlopen(request)
#在urllib里面 判断是get请求还是post请求，就是判断是否提交了data参数
print(request.get_method())

>> 输出结果
GET

手机

req = request.Request(\'http://www.douban.com/\')
req.add_header(\'User-Agent\', \'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) \'
                             \'AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25\')
with request.urlopen(req) as f:
    print(\'Status:\', f.status, f.reason)
    for k, v in f.getheaders():
        print(\'%s: %s\' % (k, v))
    print(\'Data:\', f.read().decode(\'utf-8\'))

3.Cookie的使用

客户端用于记录用户身份,维持登录信息

import http.cookiejar, urllib.request

# 1 创建CookieJar对象
cookie = http.cookiejar.CookieJar()
# 使用HTTPCookieProcessor创建cookie处理器，
handler = urllib.request.HTTPCookieProcessor(cookie)
# 构建opener对象
opener = urllib.request.build_opener(handler)
# 将opener安装为全局
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url)


# 2 保存cookie为文本
import http.cookiejar, urllib.request
filename = "cookie.txt"
# 保存类型有很多种
## 类型1
cookie = http.cookiejar.MozillaCookieJar(filename)
## 类型2
cookie = http.cookiejar.LWPCookieJar(filename)

# 使用相应的方法读取
cookie = http.cookiejar.LWPCookieJar()
cookie.load(\'cookie.txt\',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
……

4.设置代理

当需要抓取的网站设置了访问限制，这时就需要用到代理来抓取数据。

import urllib.request

url = \'http://httpbin.org/ip\'
proxy = {\'http\':\'39.134.108.89:8080\',\'https\':\'39.134.108.89:8080\'}
proxies = urllib.request.ProxyHandler(proxy) # 创建代理处理器
opener = urllib.request.build_opener(proxies,urllib.request.HTTPHandler) # 创建特定的opener对象
urllib.request.install_opener(opener) # 安装全局的opener 把urlopen也变成特定的opener
data = urllib.request.urlopen(url)
print(data.read().decode())

回到顶部

urllib.error

urllib.error可以接收有urllib.request产生的异常。urllib.error中常用的有两个方法，URLError和HTTPError。URLError是OSError的一个子类，

HTTPError是URLError的一个子类，服务器上HTTP的响应会返回一个状态码，根据这个HTTP状态码，我们可以知道我们的访问是否成功。

URLError

URLError产生原因一般是:网络无法连接、服务器不存在等。

例如访问一个不存在的url

import urllib.error
import urllib.request
requset = urllib.request.Request(\'http://www.usahfkjashfj.com/\')
try:
    urllib.request.urlopen(requset).read()
except urllib.error.URLError as e:
    print(e.reason)
else:
    print(\'success\')


>> print结果
[Errno 11004] getaddrinfo failed

HTTPError

HTTPError是URLError的子类，在你利用URLopen方法发出一个请求时，服务器上都会对应一个应答对象response，其中他包含一个数字“状态码”，

例如response是一个重定向，需定位到别的地址获取文档，urllib将对此进行处理。其他不能处理的，URLopen会产生一个HTTPError，对应相应的状态码，

HTTP状态码表示HTTP协议所返回的响应的状态。

from urllib import request, error
    try:
        response = request.urlopen(\'http://cuiqingcai.com/index.htm\')
    except error.URLError as e:
        print(e.reason)

    # 先捕获子类错误
    try:
        response = request.urlopen(\'http://cuiqingcai.com/index.htm\')
    except error.HTTPError as e:
        print(e.reason, e.code, e.headers, sep=\'\\n\')
    except error.URLError as e:
        print(e.reason)
    else:
        print(\'Request Successfully\')

>> print结果

Not Found

-------------
Not Found
404
Server: nginx/1.10.3 (Ubuntu)
Date: Thu, 08 Feb 2018 14:45:39 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT

回到顶部

urllib.parse

urllib.parse.urljoin 拼接url

基于一个base URL和另一个URL构造一个绝对URL,url必须为一致站点,否则后面参数会覆盖前面的host

print(parse.urljoin(\'https://www.jianshu.com/xyz\',\'FAQ.html\'))
print(parse.urljoin(\'http://www.baidu.com/about.html\',\'http://www.baidu.com/FAQ.html\'))

>>结果
https://www.jianshu.com/FAQ.html
http://www.baidu.com/FAQ.html

urllib.parse.urlencode 字典转字符串

from urllib import request, parse
url = r\'https://www.jianshu.com/collections/20f7f4031550/mark_viewed.json\'
headers = {
    \'User-Agent\': r\'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36\',
    \'Referer\': r\'https://www.jianshu.com/c/20f7f4031550?utm_medium=index-collections&utm_source=desktop\',
    \'Connection\': \'keep-alive\'
}
data = {
    \'uuid\': \'5a9a30b5-3259-4fa0-ab1f-be647dbeb08a\',
}
#Post的数据必须是bytes或者iterable of bytes，不能是str，因此需要进行encode（）编码
data = parse.urlencode(data).encode(\'utf-8\')
print(data)
req = request.Request(url, headers=headers, data=data)
page = request.urlopen(req).read()
page = page.decode(\'utf-8\')
print(page)

>>结果
b\'uuid=5a9a30b5-3259-4fa0-ab1f-be647dbeb08a\'
{"message":"success"}

urllib.parse.quote url编码

urllib.parse.unquote url解码

Url的编码格式采用的是ASCII码，而不是Unicode，比如

http://so.biquge.la/cse/search?s=7138806708853866527&q=%CD%EA%C3%C0%CA%C0%BD%E7

from urllib import parse

x = parse.quote(\'山西\', encoding=\'gb18030\')# encoding=\'GBK
print(x)  #%C9%BD%CE%F7


city = parse.unquote(\'%E5%B1%B1%E8%A5%BF\',)  # encoding=\'utf-8\'
print(city)  # 山西

urllib3包

Urllib3是一个功能强大，条理清晰，用于HTTP客户端的Python库，许多Python的原生系统已经开始使用urllib3。Urllib3提供了很多python标准库里所没有的重要特性：

1.线程安全
2.连接池
3.客户端SSL/TLS验证
4.文件分部编码上传
5.协助处理重复请求和HTTP重定位
6.支持压缩编码
7.支持HTTP和SOCKS代理

安装：

Urllib3 能通过pip来安装：

$pip install urllib3

你也可以在github上下载最新的源码，解压之后进行安装：

$git clone git://github.com/shazow/urllib3.git

$python setup.py install

urllib3的使用：

request GET请求

import urllib3
import requests
#  忽略警告：InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised.
requests.packages.urllib3.disable_warnings()
# 一个PoolManager实例来生成请求, 由该实例对象处理与线程池的连接以及线程安全的所有细节
http = urllib3.PoolManager()
# 通过request()方法创建一个请求：
r = http.request(\'GET\', \'http://cuiqingcai.com/\')
print(r.status) # 200
# 获得html源码,utf-8解码
print(r.data.decode())

request GET请求(添加数据)

 header = {
        \'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36\'
    }
    r = http.request(\'GET\',
             \'https://www.baidu.com/s?\',
             fields={\'wd\': \'hello\'},
             headers=header)
    print(r.status) # 200
    print(r.data.decode())

post请求

  #你还可以通过request()方法向请求(request)中添加一些其他信息，如：
    header = {
        \'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36\'
    }
    r = http.request(\'POST\',
                     \'http://httpbin.org/post\',
                     fields={\'hello\':\'world\'},
                     headers=header)
    print(r.data.decode())

# 对于POST和PUT请求(request),需要手动对传入数据进行编码，然后加在URL之后：
encode_arg = urllib.parse.urlencode({\'arg\': \'我的\'})
print(encode_arg.encode())
r = http.request(\'POST\',
                 \'http://httpbin.org/post?\'+encode_arg,
                 headers=header)
# unicode解码

print(r.data.decode(\'unicode_escape\'))

发送json数据

#JSON:在发起请求时,可以通过定义body 参数并定义headers的Content-Type参数来发送一个已经过编译的JSON数据：
import json
data={\'attribute\':\'value\'}
encode_data= json.dumps(data).encode()

r = http.request(\'POST\',
                     \'http://httpbin.org/post\',
                     body=encode_data,
                     headers={\'Content-Type\':\'application/json\'}
                 )
print(r.data.decode(\'unicode_escape\'))

上传文件

#使用multipart/form-data编码方式上传文件,可以使用和传入Form data数据一样的方法进行,并将文件定义为一个元组的形式　　　　　(file_name,file_data):
with open(\'1.txt\',\'r+\',encoding=\'UTF-8\') as f:
    file_read = f.read()

r = http.request(\'POST\',
                 \'http://httpbin.org/post\',
                 fields={\'filefield\':(\'1.txt\', file_read, \'text/plain\')
                         })
print(r.data.decode(\'unicode_escape\'))

#二进制文件
with open(\'websocket.jpg\',\'rb\') as f2:
    binary_read = f2.read()

r = http.request(\'POST\',
                 \'http://httpbin.org/post\',
                 body=binary_read,
                 headers={\'Content-Type\': \'image/jpeg\'})
#
# print(json.loads(r.data.decode(\'utf-8\'))[\'data\'] )
print(r.data.decode(\'utf-8\'))

使用Timeout

#使用timeout，可以控制请求的运行时间。在一些简单的应用中，可以将timeout参数设置为一个浮点数：
r = http.request(\'POST\',
                 \'http://httpbin.org/post\',timeout=3.0)

print(r.data.decode(\'utf-8\'))

#让所有的request都遵循一个timeout，可以将timeout参数定义在PoolManager中：
http = urllib3.PoolManager(timeout=3.0)

对重试和重定向进行控制

#通过设置retries参数对重试进行控制。Urllib3默认进行3次请求重试，并进行3次方向改变。
r = http.request(\'GET\',
                 \'http://httpbin.org/ip\',retries=5)#请求重试的次数为5

print(r.data.decode(\'utf-8\'))
##关闭请求重试(retrying request)及重定向(redirect)只要将retries定义为False即可：
r = http.request(\'GET\',
                 \'http://httpbin.org/redirect/1\',retries=False,redirect=False)
print(\'d1\',r.data.decode(\'utf-8\'))
#关闭重定向(redirect)但保持重试(retrying request),将redirect参数定义为False即可
r = http.request(\'GET\',
                 \'http://httpbin.org/redirect/1\',redirect=False)

以上是关于python urllib和urllib3包的主要内容，如果未能解决你的问题，请参考以下文章

python---urllib3库

Python网络请求urllib和urllib3详解

Python爬虫初学-urllib3

python 的http请求模块 urllib3

Python标准库：HTTP客户端库urllib3

Python--urllib3库详解1