爬虫请求库——requests

Posted 2020-10-21 ゛竹先森゜

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫请求库——requests相关的知识，希望对你有一定的参考价值。

　　请求库，即可以模仿浏览器对网站发起请求的模块（库）。

requests模块

　　使用requests可以模拟浏览器的请求，requests模块的本质是封装了urllib3模块的功能，比起之前用到的urllib，requests模块的api更加便捷

　　requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求，但是selenium模块就可以执行js的操作。

安装：

pip3 install requests

请求方式：主要用到的就get和post两种

#各种请求方式：常用的就是requests.get()和requests.post()
import requests

r = requests.get(\'https://api.github.com/events\')
r = requests.post(\'http://httpbin.org/post\', data = {\'key\':\'value\'})
r = requests.put(\'http://httpbin.org/put\', data = {\'key\':\'value\'})
r = requests.delete(\'http://httpbin.org/delete\')
r = requests.head(\'http://httpbin.org/get\')
r = requests.options(\'http://httpbin.org/get\')

#GET请求
HTTP默认的请求方法就是GET
     * 没有请求体
     * 数据必须在1K之内！
     * GET请求数据会暴露在浏览器的地址栏中

GET请求常用的操作：
       1. 在浏览器的地址栏中直接给出URL，那么就一定是GET请求
       2. 点击页面上的超链接也一定是GET请求
       3. 提交表单时，表单默认使用GET请求，但可以设置为POST


#POST请求
(1). 数据不会出现在地址栏中
(2). 数据的大小没有上限
(3). 有请求体
(4). 请求体中如果存在中文，会使用URL编码！


#！！！requests.post()用法与requests.get()完全一致，特殊的是requests.post()有一个data参数，用来存放请求体数据

基于GET的请求方式

一、基本请求的代码

import requests
response=requests.get(\'http://http://www.cnblogs.com/\')
print(response.text)

二、GET请求的参数

　　GET请求的参数放在url的问号后面，以键值对形式传参

方式一：自行拼接参数（原理就是这样，不过我们一般都用方式二的形式）

#在请求头内将自己伪装成浏览器，否则百度不会正常返回页面内容
import requests

response=requests.get(\'https://www.baidu.com/s?wd=python&pn=1\',
                     #请求头信息
                      headers={
                        \'User-Agent\':\'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/62.0.3202.75 Safari/537.36\',
                      })
print(response.text)


#如果查询关键词是中文或者有其他特殊符号，则不得不进行url编码后再拼接
from urllib.parse import urlencode
wd=\'苍老师\'
encode_res=urlencode({\'k\':wd},encoding=\'utf-8\')
keyword=encode_res.split(\'=\')[1]#拿到编码后的字符串
print(keyword)
# 然后拼接成url
url=\'https://www.baidu.com/s?wd=%s&pn=1\' %keyword

response=requests.get(url,
                      headers={
                        \'User-Agent\':\'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36\',
                      })
res1=response.text

View Code

方式二：利用params参数（原理就是底层封装了方式一）

#上述操作可以用requests模块的一个params参数搞定，本质还是调用urlencode
from urllib.parse import urlencode
wd=\'egon老师\'
pn=1

response=requests.get(\'https://www.baidu.com/s\',
                     #参数进行传参，帮我们省去了urlencode这一步
                      params={
                          \'wd\':wd,
                          \'pn\':pn
                      },
                      headers={
                        \'User-Agent\':\'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36\',
                      })
res2=response.text

#验证结果，打开a.html与b.html页面内容一样
with open(\'a.html\',\'w\',encoding=\'utf-8\') as f:
    f.write(res1) 
with open(\'b.html\', \'w\', encoding=\'utf-8\') as f:
    f.write(res2)

params参数的使用

View Code

三、请求头headers

　　一般情况下，浏览器在发送GET请求的时候都会有请求头，用于放置跟客户端有关的信息。有些网站必须要有某些参数，这时我们在用爬虫伪造浏览器发送GET请求时就要在headers参数下设置好请求头中会携带的参数

常见请求头：

#通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下
Host
Referer    #大型网站通常都会根据该参数判断请求的来源
User-Agent #客户端
Cookie     #Cookie信息包含在请求头里，requests模块有单独的参数来处理他，处理后headers={}内就可以不用放置

#添加headers(浏览器会识别请求头,不加可能会被拒绝访问,比如访问https://www.zhihu.com/explore)
import requests
response=requests.get(\'https://www.zhihu.com/explore\')
response.status_code #500


#自己定制headers
headers={
    \'User-Agent\':\'Mozilla/5.0 (Linux; android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36\',

}
respone=requests.get(\'https://www.zhihu.com/explore\',
                     headers=headers)
print(respone.status_code) #200

设置请求头的方式

四、cookie信息
　　有些要登录才能实行的功能，我们就必须在发起请求的信息中带上cookie，该cookie需要先由浏览器真实访问后得到，不可伪造。但我们可以在浏览器真实访问后将cookie信息复制出来，在我们爬虫发起GET请求时传入该参数即可

#登录github，然后从浏览器中获取cookies，以后就可以直接拿着cookie登录了，无需输入用户名密码
#用户名:egonlin 邮箱378533872@qq.com 密码lhf@123
import requests

Cookies={   
　　\'user_session\':\'wGMHFJKgDcmRIVvcA14_Wrt_3xaUyJNsBnPbYzEL6L0bHcfc\',
}

response=requests.get(
　　　　　　　　\'https://github.com/settings/emails\',
              cookies=Cookies) #github对请求头没有什么限制，我们无需定制user-agent，对于其他网站可能还需要定制


print(\'378533872@qq.com\' in response.text) #True

　　然而，每次都要先打开浏览器访问链接再讲cookie复制出来是一个非常麻烦的事，真正的程序员怎么可以被这个给限制住。所以，我们就要用到requests模块下的一个函数

import requests

session = requests.session()

　　之后，我们每次发起GET请求的时候都用session代替requests即可，再也不必为传cookie而烦恼啦。（从此忘记cookie也行哈哈哈）

示例代码：

session.get(
　　　　　　　　　 \'https://passport.lagou.com/grantServiceTicket/grant.html\',
                 headers={
                     \'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36\',
                     \'Referer\': \'https://passport.lagou.com/login/login.html\',
                 }
                 )

基于POST的请求方式

　　请求的代码与GET相同，不过因为POST请求的参数放在请求体里，所以在传参的时候会多一个data参数，用于放置请求的参数信息。

\'\'\'
一 目标站点分析
    浏览器输入https://github.com/login
    然后输入错误的账号密码，抓包
    发现登录行为是post提交到：https://github.com/session
    而且请求头包含cookie
    而且请求体包含：
        commit:Sign in
        utf8:✓
        authenticity_token:lbI8IJCwGslZS8qJPnof5e7ZkCoSoMn6jmDTsL1r/m06NLyIbw7vCrpwrFAPzHMep3Tmf/TSJVoXWrvDZaVwxQ==
        login:egonlin
        password:123


二 流程分析
    先GET：https://github.com/login拿到初始cookie与authenticity_token
    返回POST：https://github.com/session， 带上初始cookie，带上请求体（authenticity_token，用户名，密码等）
    最后拿到登录cookie

    ps：如果密码时密文形式，则可以先输错账号，输对密码，然后到浏览器中拿到加密后的密码，github的密码是明文
\'\'\'

import requests
import re

#第一次请求
r1=requests.get(\'https://github.com/login\')
r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)
authenticity_token=re.findall(r\'name="authenticity_token".*?value="(.*?)"\',r1.text)[0] #从页面中拿到CSRF TOKEN

#第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码
data={
    \'commit\':\'Sign in\',
    \'utf8\':\'✓\',
    \'authenticity_token\':authenticity_token,
    \'login\':\'317828332@qq.com\',
    \'password\':\'alex3714\'
}
r2=requests.post(\'https://github.com/session\',
             data=data,
             cookies=r1_cookie
             )


login_cookie=r2.cookies.get_dict()


#第三次请求：以后的登录，拿着login_cookie就可以,比如访问一些个人配置
r3=requests.get(\'https://github.com/settings/emails\',
                cookies=login_cookie)

print(\'317828332@qq.com\' in r3.text) #True

自动登录github（自己处理cookie信息）

手动处理cookie后实现自动登录github

import requests
import re

session=requests.session()
#第一次请求
r1=session.get(\'https://github.com/login\')
authenticity_token=re.findall(r\'name="authenticity_token".*?value="(.*?)"\',r1.text)[0] #从页面中拿到CSRF TOKEN

#第二次请求
data={
    \'commit\':\'Sign in\',
    \'utf8\':\'✓\',
    \'authenticity_token\':authenticity_token,
    \'login\':\'317828332@qq.com\',
    \'password\':\'alex3714\'
}
r2=session.post(\'https://github.com/session\',
             data=data,
             )

#第三次请求
r3=session.get(\'https://github.com/settings/emails\')

print(\'317828332@qq.com\' in r3.text) #True

使用requests.session()自动帮我们处理cookie信息后自动登录GitHub

补充说明

requests.post(url=\'xxxxxxxx\',
              data={\'xxx\':\'yyy\'}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed

#如果我们自定义请求头是application/json,并且用data传值, 则服务端取不到值
requests.post(url=\'\',
              data={\'\':1,},
              headers={
                  \'content-type\':\'application/json\'
              })


requests.post(url=\'\',
              json={\'\':1,},
              ) #默认的请求头:application/json

View Code

响应Response

　　response是响应信息，具体属性如下

import requests
respone=requests.get(\'http://www.jianshu.com\')
# respone属性
print(respone.text)   #包含html内容的字符串
print(respone.content)#bytes类型的字符串

print(respone.status_code)#状态码200
print(respone.headers)    #请求头信息
print(respone.cookies)    #COOKIE对象
print(respone.cookies.get_dict())#封装了cookie参数的字典{\'locale\': \'zh-CN\'}

print(respone.cookies.items())#封装了cookie参数的列表[(\'locale\', \'zh-CN\')]

print(respone.url)#url地址https://www.jianshu.com/
print(respone.history)

print(respone.encoding)#编码utf-8

#关闭：response.close()
from contextlib import closing
with closing(requests.get(\'xxx\',stream=True)) as response:
    for line in response.iter_content():
    pass#将内容一行一行写到文件中

有时网页不是utf-8的编码，所以我们要解决编码的问题

#编码问题
import requests
response=requests.get(\'http://www.autohome.com/news\')
# response.encoding=\'gbk\' #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码
print(response.text)

获取二进制数据和二进制流

　　图片和视频我们获取下来时是二进制数据，这时我们要用wb的形式写到本地。但有时视频很大，我们一下子加载到内存再去写入本地是不合理的，所以我们会用到二进制流

写二进制数据：

import requests

response=requests.get(\'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1509868306530&di=712e4ef3ab258b36e9f4b48e85a81c9d&imgtype=0&src=http%3A%2F%2Fc.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F11385343fbf2b211e1fb58a1c08065380dd78e0c.jpg\')

with open(\'a.jpg\',\'wb\') as f:
    f.write(response.content)

View Code

二进制流：

#stream参数:一点一点的取,比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的

import requests

response=requests.get(\'https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4\',
                      stream=True)

with open(\'b.mp4\',\'wb\') as f:
    for line in response.iter_content():
        f.write(line)

View Code

解析json数据：

#解析json
import requests
response=requests.get(\'http://httpbin.org/get\')

import json
res1=json.loads(response.text) #太麻烦

res2=response.json() #直接获取json数据

print(res1 == res2) #True

Redirection and History

By default Requests will perform location redirection for all verbs except HEAD.

We can use the history property of the Response object to track redirection.

The Response.history list contains the Response objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response.

For example, GitHub redirects all HTTP requests to HTTPS:

>>> r = requests.get(\'http://github.com\')

>>> r.url
\'https://github.com/\'

>>> r.status_code

>>> r.history
[<Response [301]>]
If you\'re using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the allow_redirects parameter:

>>> r = requests.get(\'http://github.com\', allow_redirects=False)

>>> r.status_code

>>> r.history
[]
If you\'re using HEAD, you can enable redirection as well:

>>> r = requests.head(\'http://github.com\', allow_redirects=True)

>>> r.url
\'https://github.com/\'

>>> r.history
[<Response [301]>]

先看官网的解释

官网解释（强行装作看得懂的样子）

import requests
import re

#第一次请求
r1=requests.get(\'https://github.com/login\')
r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)
authenticity_token=re.findall(r\'name="authenticity_token".*?value="(.*?)"\',r1.text)[0] #从页面中拿到CSRF TOKEN

#第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码
data={
    \'commit\':\'Sign in\',
    \'utf8\':\'✓\',
    \'authenticity_token\':authenticity_token,
    \'login\':\'317828332@qq.com\',
    \'password\':\'alex3714\'
}






#测试一：没有指定allow_redirects=False,则响应头中出现Location就跳转到新页面，r2代表新页面的response
r2=requests.post(\'https://github.com/session\',
             data=data,
             cookies=r1_cookie
             )

print(r2.status_code) #200
print(r2.url) #看到的是跳转后的页面
print(r2.history) #看到的是跳转前的response
print(r2.history[0].text) #看到的是跳转前的response.text


#测试二：指定allow_redirects=False,则响应头中即便出现Location也不会跳转到新页面，r2代表的仍然是老页面的response
r2=requests.post(\'https://github.com/session\',
             data=data,
             cookies=r1_cookie,
             allow_redirects=False
             )


print(r2.status_code) #302
print(r2.url) #看到的是跳转前的页面https://github.com/session
print(r2.history) #[]

利用github登录后跳转到主页面的例子来验证它

进阶知识点

　　其实学会了以上的知识，就已经可以做到用requests去爬网页了。下面的知识是高级用法，但其实并不难。

一、SSL证书验证

　　很多大网站用的都是https，而https是要求访问者携带证书的，但是其实有些不用证书也可以访问，大多数情况都是可以携带也可以不携带证书比如知乎和百度等

但有些是有硬性要求的,则必须带，比如对于定向的用户,拿到证书后才有权限访问。例如12306

　　这个时候我们在用爬虫发起请求的时候就需要做一些手脚啦

#证书验证(大部分网站都是https)
import requests
respone=requests.get(\'https://www.12306.cn\') #如果是ssl请求,首先检查证书是否合法,不合法则报错,不允许访问


#改进1:去掉报错,能访问但是会报警告，因为毕竟我们没有证书
import requests
respone=requests.get(\'https://www.12306.cn\',verify=False) #verify参数改成false就不验证证书,报警告,能访问，返回200
print(respone.status_code)#200


#改进2:去掉报错,并且去掉警报信息，与上步没啥太大区别
import requests
from requests.packages import urllib3
urllib3.disable_warnings() #关闭警告
respone=requests.get(\'https://www.12306.cn\',verify=False)
print(respone.status_code)

#改进3:加上证书
#很多网站都是https,但是不用证书也可以访问,大多数情况都是可以携带也可以不携带证书
#知乎\\百度等都是可带可不带
#有硬性要求的,则必须带，比如对于定向的用户,拿到证书后才有权限访问某个类似12306的特定网站
import requests
respone=requests.get(\'https://www.12306.cn\',
                     cert=(\'/path/server.crt\',
                           \'/path/key\'))#证书存放在本地
print(respone.status_code)

View Code

二、代理

　　有些时候，我们用爬虫访问某网站请求的太频繁时人家网站会检测到你的不法行为然后将你的ip封掉，这样你就不能继续做不可告人的事啦。怎么办呢？我们可以用到代理。

代理的原理是：我们先发送请求到代理的IP上，然后由代理帮忙发送。代理IP可以在网上搜，一搜一大把，但代理也是有时效性的，尽量用最新鲜的，因为老旧的可能已经被人用烂然后早就被封了。。

使用http代理的栗子：

#官网链接: http://docs.python-requests.org/en/master/user/advanced/#proxies

#代理设置:先发送请求给代理,然后由代理帮忙发送
import requests
#代理的参数
proxies={
    \'http\':\'http://egon:123@localhost:9743\',#带用户名密码的代理,@符号前是用户名与密码，@符号后是IP和端口
    \'http\':\'http://localhost:9743\',#不带用户名密码的代理
    \'https\':\'https://localhost:9743\',
}
respone=requests.get(\'https://www.12306.cn\',
                     proxies=proxies)

print(respone.status_code)

扩展：socks代理

　　Socks 代理与应用层代理、 HTTP 层代理不同，Socks代理只是简单地传递数据包，而不必关心是何种应用协议（比如FTP、HTTP和NNTP请求）。所以，Socks代理比其他应用层代理要快得多。它通常绑定在代理服务器的1080端口上。如果在企业网或校园网上，需要透过防火墙或通过代理服务器访问Internet就可能需要使用SOCKS。

　　一般情况下，对于拨号上网用户都不需要使用它。我们常用的代理服务器仍然是专门的http代理，它和SOCKS是不同的。因此，能浏览网页不等于您一定可以通过SOCKS访问Internet。常用的防火墙，或代理软件都支持SOCKS，但需要其管理员打开这一功能。为了使用socks，需要了解一下内容：

　　① SOCKS服务器的IP地址

　　② SOCKS服务所在的端口

　　③ 这个SOCKS服务是否需要用户认证？如果需要，就要向网络管理员申请一个用户和口令

知道了上述信息，您就可以把这些信息填入“网络配置”中，或者在第一次登记时填入，您就可以使用socks代理了。

　　不过我们在爬虫阶段不必了解那么多，我们只要知道用http代理的时候必须区分开是http请求还是https请求，因为代码是不同的。而利用socks代理的时候就可以五十这些，统一使用socks。

示例代码：

#支持socks代理,安装:pip install requests[socks]
import requests

#用socks代替了http和https，无需特意做区分
proxies = {
    \'http\': \'socks5://user:pass@host:port\',
    \'https

   
 (c)2006-2024 SYSTEM All Rights Reserved  IT常识