Day534.Urllib爬虫 -python
Posted 阿昌喜欢吃黄桃
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Day534.Urllib爬虫 -python相关的知识,希望对你有一定的参考价值。
Urllib
一、反爬手段
- User‐Agent:
- User Agent中文名为用户代理,简称 UA,它是一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等。
- 代理IP
- 西次代理
- 快代理
- 什么是高匿名、匿名和透明代理?它们有什么区别?
1.使用透明代理,对方服务器可以知道你使用了代理,并且也知道你的真实IP。
2.使用匿名代理,对方服务器可以知道你使用了代理,但不知道你的真实IP。
3.使用高匿名代理,对方服务器不知道你使用了代理,更不知道你的真实IP。
- 验证码访问
- 打码平台
- 云打码平台
- 超级🦅 - 动态加载网页 网站返回的是js数据 并不是网页的真实数据
- selenium驱动真实的浏览器发送请求
- 数据加密
- 分析js代码
二、urllib库使用
- 基本使用
import urllib.request # 使用urllib获取百度首页源码 url = 'http://www.baidu.com' # 模拟浏览器发送请求 resp = urllib.request.urlopen(url) print(resp) context = resp.read().decode("utf-8") print(context)
三、请求对象的定制
UA介绍:User Agent中文名为用户代理,简称 UA,它是一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本。浏览器内核、浏览器渲染引擎、浏览器语言、浏览器插件等
语法:request = urllib.request.Request()
import urllib.request
url = 'https://www.baidu.com'
headers=
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
requestObj=urllib.request.Request(url, [],headers)
resp = urllib.request.urlopen(requestObj)
print(resp.read().decode("utf-8"))
四、编解码
1、get请求方式:urllib.parse.quote()
import urllib.parse
import urllib.request
searchWord = urllib.parse.quote("蝙蝠侠")
url = 'https://www.sogou.com/web?query='+ searchWord
headers=
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
# print(url)
requestObj = urllib.request.Request(url=url,headers=headers)
resp = urllib.request.urlopen(requestObj)
context = resp.read().decode("utf-8")
print(context)
2、get请求方式:urllib.parse.urlencode()
import urllib.parse
import urllib.request
data =
'query':'蝙蝠侠',
'w':'01019900'
searchWord = urllib.parse.urlencode(data)
preUrl = 'https://www.sogou.com/web?'
url = preUrl+searchWord
requestObj = urllib.request.Request(url,[],headers='User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36')
resp = urllib.request.urlopen(requestObj)
context = resp.read().decode("utf-8")
print(context)
3、post请求方式
#eg:百度翻译
import urllib.request
import urllib.parse
url = 'https://fanyi.baidu.com/sug'
headers =
'user‐agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/74.0.3729.169 Safari/537.36'
keyword = input('请输入您要查询的单词')
data =
'kw':keyword
data = urllib.parse.urlencode(data).encode('utf‐8')
request = urllib.request.Request(url=url,headers=headers,data=data)
response = urllib.request.urlopen(request)
print(response.read().decode('utf‐8'))
- 百度详细翻译
import urllib.parse
import urllib.request
import json
url = 'https://fanyi.baidu.com/v2transapi?from=zh&to=en'
headers =
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
'Accept':'*/*',
# 'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8',
'Connection':'keep-alive',
'Content-Length':'148',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie':'BIDUPSID=E4EED4AD32F68A0029150518B71E6473; PSTM=1644069814; BAIDUID=E4EED4AD32F68A00C795D774149F4DCE:FG=1; __yjs_duid=1_03f1094640beee76272a007638c459db1644071304166; BDUSS=dsWE1lc0kxbWdZTHNLejFRanNJOVNvZjVQSjJaZnpFWFlIa2t6ZXAyYnhqaWxpSVFBQUFBJCQAAAAAAAAAAAEAAAA0MjAmb285OTU5MzE1NzYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPEBAmLxAQJiR1; BDUSS_BFESS=dsWE1lc0kxbWdZTHNLejFRanNJOVNvZjVQSjJaZnpFWFlIa2t6ZXAyYnhqaWxpSVFBQUFBJCQAAAAAAAAAAAEAAAA0MjAmb285OTU5MzE1NzYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPEBAmLxAQJiR1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_PREFER_SWITCH=1; SOUND_SPD_SWITCH=1; APPGUIDE_10_0_2=1; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1644391469,1644461433; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; delPer=0; PSINO=5; BAIDUID_BFESS=0434881F591F38E40670B2C4E152E2CB:FG=1; H_PS_PSSID=34430_35104_31254_35489_34584_35490_35542_35797_35320_26350_35746; BA_HECTOR=0l0g008g802g8h8ljs1h09avc0r; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1644474869; ab_sr=1.0.1_NjFhODYxOTg3ODkzNmExOThhNGNmZjNkYWQ1MGUyYjgyNWEzMTMzOTcxM2IzY2Q1NDk3ZjE4YzgyZDhlNzIyNTdlYTBhZjY5MTFjOWRiYzgzYTg5ZGExNjQxNGVkM2U1YjQyNzVjMDc4N2M0YTVjYzcyYzAyODMyNjY0MDEyMTM0MTQ4NzllMmFiZTM3MDNlYzY3OTUzMTBmMWE2NzM2YmY5ZDQ5Nzk1NDMzMWI0MWI0ZTg4NDNkNTFmM2M4ZTFm',
'Host':'fanyi.baidu.com',
'Origin': 'https://fanyi.baidu.com',
'Referer': 'https://fanyi.baidu.com/translate?aldtype=16047&query=&keyfrom=baidu&smartresult=dict&lang=auto2zh',
'sec-ch-ua': '"Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': "Windows",
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'X-Requested-With': 'XMLHttpRequest'
data =
'from': 'zh',
'to': 'en',
'query': '房子',
'transtype': 'realtime',
'simple_means_flag': '3',
'sign': '289459.35202',
'token': '1ee0174682b16644fbbcde79861a56e5',
'domain': 'common',
key = urllib.parse.urlencode(data).encode('utf-8')
requestObj = urllib.request.Request(url,key,headers)
resp = urllib.request.urlopen(requestObj)
context = resp.read().decode('utf-8')
str = json.loads(context)
print(str)
五、ajax的get请求
- get请求豆瓣电影前十页
import urllib.parse
import urllib.request
import json
# url = 'https://movie.douban.com/j/chart/top_list?type=19&interval_id=100%3A90&action=&start='+page+'&limit=20'
headers =
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
def getjson(page):
url = 'https://movie.douban.com/j/chart/top_list?type=19&interval_id=100%3A90&action=&start=' + str(
(page - 1) * 20) + '&limit=20'
requestObj = urllib.request.Request(url=url, headers=headers)
resp = urllib.request.urlopen(requestObj)
context = resp.read().decode('utf-8')
return context
def download(page, context):
with open("豆瓣电影_第" + str(page) + "页.json", 'w', encoding='utf-8') as fp:
fp.write(context)
if __name__ == '__main__':
start_page = int(input("开始页"))
end_page = int(input("结束页"))
for page in range(start_page, end_page + 1):
context = getjson(page)
if context != '[]':
download(page, context)
六、ajax的post请求
- post请求肯德基温州门店信息
import urllib.parse
import urllib.request
# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
headers =
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
def getSearchParams(page,city):
data =
'cname': city,
'pageIndex': page,
'pageSize': '10'
searchParams = urllib.parse.urlencode(data).encode("utf-8")
return searchParams
def getContext(searchParams):
reqObj = urllib.request.Request(url=url, headers=headers, data=searchParams)
resp = urllib.request.urlopen(reqObj)
context = resp.read().decode('utf-8')
return context
def download(city,index,context):
with open("肯德基门店信息_" + city + "_" + str(index)+'.json', 'w', encoding='utf-8') as fp:
fp.write(context)
if __name__ == '__main__':
page = int(input("爬取第几页?"))
city = input("什么城市?")
for index in range(1,page+1):
searchParams = getSearchParams(index,city)
context = getContext(searchParams)
if context != '[]':
download(city,index,context)
七、URLError\\HTTPError
eg:
import urllib.request
import urllib.error
url = 'https://blog.csdn.net/ityard/article/details/102646738'
# url = 'http://www.goudan11111.com'
headers =
# 'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,applicati
on/signed‐exchange;v=b3',
# 'Accept‐Encoding': 'gzip, deflate, br',
# 'Accept‐Language': 'zh‐CN,zh;q=0.9',
# 'Cache‐Control': 'max‐age=0',
# 'Connection': 'keep‐alive',
'Cookie': 'uuid_tt_dd=10_19284691370‐1530006813444‐566189;
smidV2=2018091619443662be2b30145de89bbb07f3f93a3167b80002b53e7acc61420;
_ga=GA1.2.1823123463.1543288103; dc_session_id=10_1550457613466.265727;
acw_tc=2760821d15710446036596250e10a1a7c89c3593e79928b22b3e3e2bc98b89;
Hm_lvt_e5ef47b9f471504959267fd614d579cd=1571329184;
Hm_ct_e5ef47b9f471504959267fd614d579cd=6525*1*10_19284691370‐1530006813444‐566189;__yadk_uid=r0LSXrcNYgymXooFiLaCGt1ahSCSxMCb;
Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1571329199,1571329223,1571713144,1571799968;acw_sc__v2=5dafc3b3bc5fad549cbdea513e330fbbbee00e25; firstDie=1; SESSION=396bc85c‐556b‐42bd‐890c‐c20adaaa1e47; UserName=weixin_42565646; UserInfo=d34ab5352bfa4f21b1eb68cdacd74768;UserToken=d34ab5352bfa4f21b1eb68cdacd74768;UserNick=weixin_42565646;AU=7A5;UN=weixin_42565646;BT=1571800370777;p_uid=U000000;dc_tos=pzt4xf;Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1571800372;Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=1788*1*PC_VC!6525*1*10_192846913701530006813444566189!5744*1*weixin_42565646;announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblogdev.blog.csdn.net%252Farticle%252Fdetails%252F102605809%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D',
# 'Host': 'blog.csdn.net',
# 'Referer': 'https://passport.csdn.net/login?code=public',
# 'Sec‐Fetch‐Mode': 'navigate',
# 'Sec‐Fetch‐Site': 'same‐site',
# 'Sec‐Fetch‐User': '?1',
# 'Upgrade‐Insecure‐Requests': '1',
'User‐Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/77.0.3865.120 Safari/537.36',
try:
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf‐8')
print(content)
except urllib.error.HTTPError:
print(1111)
except urllib.error.URLError:
print(2222)
八、Handler处理器
-
基本使用
import urllib.request url = 'http://www.baidu.com' headers = "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36', reqObj = urllib.request.Request(url=url,headers=headers) handler = urllib.request.HTTPHandler() opener = urllib.request.build_opener(handler) resp = opener.open(reqObj) context = resp.read().decode("utf-8") print(context)
-
使用代理
import random
import urllib.parse
import urllib.request
#随机代理池
proxies_pool = [
"http":'61.163.32.88:312811',
"http":'61.163.32.88:3128222'
]
proxies = random.choice(proxies_pool)
print(proxies)
url = 'http://www.baidu.com/s?wd=ip'
headers =
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
reqObj = urllib.request.Request(url=url,headers=headers)
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener(handler)
resp = opener.open(reqObj)
context = resp.read().decode("utf-8")
with open("代理.html",'w',encoding='utf-8') as fp:
fp.write(context)
以上是关于Day534.Urllib爬虫 -python的主要内容,如果未能解决你的问题,请参考以下文章