Day534.Urllib爬虫 -python

Posted 阿昌喜欢吃黄桃

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Day534.Urllib爬虫 -python相关的知识,希望对你有一定的参考价值。

Urllib

一、反爬手段

  • User‐Agent
    • User Agent中文名为用户代理,简称 UA,它是一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等。
  • 代理IP
    • 西次代理
    • 快代理
    • 什么是高匿名、匿名和透明代理?它们有什么区别?
      1.使用透明代理,对方服务器可以知道你使用了代理,并且也知道你的真实IP。
      2.使用匿名代理,对方服务器可以知道你使用了代理,但不知道你的真实IP。
      3.使用高匿名代理,对方服务器不知道你使用了代理,更不知道你的真实IP。
  • 验证码访问
    - 打码平台
    - 云打码平台
    - 超级🦅
  • 动态加载网页 网站返回的是js数据 并不是网页的真实数据
    • selenium驱动真实的浏览器发送请求
  • 数据加密
    • 分析js代码

二、urllib库使用

  • 基本使用
    import urllib.request
    # 使用urllib获取百度首页源码
    url = 'http://www.baidu.com'
    
    # 模拟浏览器发送请求
    resp = urllib.request.urlopen(url)
    print(resp)
    context = resp.read().decode("utf-8")
    print(context)
    

三、请求对象的定制

UA介绍:User Agent中文名为用户代理,简称 UA,它是一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本。浏览器内核、浏览器渲染引擎、浏览器语言、浏览器插件等

语法:request = urllib.request.Request()

import urllib.request

url = 'https://www.baidu.com'
headers=
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/97.0.4692.71 Safari/537.36"


requestObj=urllib.request.Request(url, [],headers)

resp = urllib.request.urlopen(requestObj)
print(resp.read().decode("utf-8"))

四、编解码

1、get请求方式:urllib.parse.quote()

import urllib.parse
import urllib.request

searchWord = urllib.parse.quote("蝙蝠侠")
url = 'https://www.sogou.com/web?query='+ searchWord
headers=
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'


# print(url)
requestObj = urllib.request.Request(url=url,headers=headers)
resp = urllib.request.urlopen(requestObj)
context = resp.read().decode("utf-8")
print(context)

2、get请求方式:urllib.parse.urlencode()

import urllib.parse
import urllib.request

data = 
    'query':'蝙蝠侠',
    'w':'01019900'


searchWord = urllib.parse.urlencode(data)
preUrl = 'https://www.sogou.com/web?'
url = preUrl+searchWord

requestObj = urllib.request.Request(url,[],headers='User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36')

resp = urllib.request.urlopen(requestObj)
context = resp.read().decode("utf-8")
print(context)

3、post请求方式

#eg:百度翻译

import urllib.request
import urllib.parse
url = 'https://fanyi.baidu.com/sug'
headers = 
    'user‐agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/74.0.3729.169 Safari/537.36'

keyword = input('请输入您要查询的单词')
data = 
    'kw':keyword

data = urllib.parse.urlencode(data).encode('utf‐8')
request = urllib.request.Request(url=url,headers=headers,data=data)
response = urllib.request.urlopen(request)
print(response.read().decode('utf‐8'))
  • 百度详细翻译
import urllib.parse
import urllib.request
import json

url = 'https://fanyi.baidu.com/v2transapi?from=zh&to=en'
headers = 
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    'Accept':'*/*',
    # 'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8',
    'Connection':'keep-alive',
    'Content-Length':'148',
    'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
    'Cookie':'BIDUPSID=E4EED4AD32F68A0029150518B71E6473; PSTM=1644069814; BAIDUID=E4EED4AD32F68A00C795D774149F4DCE:FG=1; __yjs_duid=1_03f1094640beee76272a007638c459db1644071304166; BDUSS=dsWE1lc0kxbWdZTHNLejFRanNJOVNvZjVQSjJaZnpFWFlIa2t6ZXAyYnhqaWxpSVFBQUFBJCQAAAAAAAAAAAEAAAA0MjAmb285OTU5MzE1NzYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPEBAmLxAQJiR1; BDUSS_BFESS=dsWE1lc0kxbWdZTHNLejFRanNJOVNvZjVQSjJaZnpFWFlIa2t6ZXAyYnhqaWxpSVFBQUFBJCQAAAAAAAAAAAEAAAA0MjAmb285OTU5MzE1NzYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPEBAmLxAQJiR1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_PREFER_SWITCH=1; SOUND_SPD_SWITCH=1; APPGUIDE_10_0_2=1; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1644391469,1644461433; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; delPer=0; PSINO=5; BAIDUID_BFESS=0434881F591F38E40670B2C4E152E2CB:FG=1; H_PS_PSSID=34430_35104_31254_35489_34584_35490_35542_35797_35320_26350_35746; BA_HECTOR=0l0g008g802g8h8ljs1h09avc0r; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1644474869; ab_sr=1.0.1_NjFhODYxOTg3ODkzNmExOThhNGNmZjNkYWQ1MGUyYjgyNWEzMTMzOTcxM2IzY2Q1NDk3ZjE4YzgyZDhlNzIyNTdlYTBhZjY5MTFjOWRiYzgzYTg5ZGExNjQxNGVkM2U1YjQyNzVjMDc4N2M0YTVjYzcyYzAyODMyNjY0MDEyMTM0MTQ4NzllMmFiZTM3MDNlYzY3OTUzMTBmMWE2NzM2YmY5ZDQ5Nzk1NDMzMWI0MWI0ZTg4NDNkNTFmM2M4ZTFm',
    'Host':'fanyi.baidu.com',
    'Origin': 'https://fanyi.baidu.com',
    'Referer': 'https://fanyi.baidu.com/translate?aldtype=16047&query=&keyfrom=baidu&smartresult=dict&lang=auto2zh',
    'sec-ch-ua': '"Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': "Windows",
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'X-Requested-With': 'XMLHttpRequest'

data = 
    'from': 'zh',
    'to': 'en',
    'query': '房子',
    'transtype': 'realtime',
    'simple_means_flag': '3',
    'sign': '289459.35202',
    'token': '1ee0174682b16644fbbcde79861a56e5',
    'domain': 'common',

key = urllib.parse.urlencode(data).encode('utf-8')
requestObj = urllib.request.Request(url,key,headers)
resp = urllib.request.urlopen(requestObj)
context = resp.read().decode('utf-8')
str = json.loads(context)
print(str)

五、ajax的get请求

  • get请求豆瓣电影前十页
import urllib.parse
import urllib.request
import json

# url = 'https://movie.douban.com/j/chart/top_list?type=19&interval_id=100%3A90&action=&start='+page+'&limit=20'
headers = 
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',



def getjson(page):
    url = 'https://movie.douban.com/j/chart/top_list?type=19&interval_id=100%3A90&action=&start=' + str(
        (page - 1) * 20) + '&limit=20'
    requestObj = urllib.request.Request(url=url, headers=headers)
    resp = urllib.request.urlopen(requestObj)
    context = resp.read().decode('utf-8')
    return context


def download(page, context):
    with open("豆瓣电影_第" + str(page) + "页.json", 'w', encoding='utf-8') as fp:
        fp.write(context)


if __name__ == '__main__':
    start_page = int(input("开始页"))
    end_page = int(input("结束页"))
    for page in range(start_page, end_page + 1):
        context = getjson(page)
        if context != '[]':
            download(page, context)

六、ajax的post请求

  • post请求肯德基温州门店信息
import urllib.parse
import urllib.request


# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
headers = 
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'


def getSearchParams(page,city):
    data = 
        'cname': city,
        'pageIndex': page,
        'pageSize': '10'
    
    searchParams = urllib.parse.urlencode(data).encode("utf-8")
    return searchParams


def getContext(searchParams):
    reqObj = urllib.request.Request(url=url, headers=headers, data=searchParams)
    resp = urllib.request.urlopen(reqObj)
    context = resp.read().decode('utf-8')
    return context


def download(city,index,context):
    with open("肯德基门店信息_" + city + "_" + str(index)+'.json', 'w', encoding='utf-8') as fp:
        fp.write(context)


if __name__ == '__main__':
    page = int(input("爬取第几页?"))
    city = input("什么城市?")
    for index in range(1,page+1):
        searchParams = getSearchParams(index,city)
        context = getContext(searchParams)
        if context != '[]':
            download(city,index,context)

七、URLError\\HTTPError

eg:
import urllib.request
import urllib.error
url = 'https://blog.csdn.net/ityard/article/details/102646738'
# url = 'http://www.goudan11111.com'
headers = 
        # 'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,applicati
on/signed‐exchange;v=b3',
        # 'Accept‐Encoding': 'gzip, deflate, br',
        # 'Accept‐Language': 'zh‐CN,zh;q=0.9',
        # 'Cache‐Control': 'max‐age=0',
        # 'Connection': 'keep‐alive',
        'Cookie': 'uuid_tt_dd=10_19284691370‐1530006813444566189;
smidV2=2018091619443662be2b30145de89bbb07f3f93a3167b80002b53e7acc61420;
_ga=GA1.2.1823123463.1543288103; dc_session_id=10_1550457613466.265727;
acw_tc=2760821d15710446036596250e10a1a7c89c3593e79928b22b3e3e2bc98b89;
Hm_lvt_e5ef47b9f471504959267fd614d579cd=1571329184;
Hm_ct_e5ef47b9f471504959267fd614d579cd=6525*1*10_19284691370‐1530006813444566189;__yadk_uid=r0LSXrcNYgymXooFiLaCGt1ahSCSxMCb;
Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1571329199,1571329223,1571713144,1571799968;acw_sc__v2=5dafc3b3bc5fad549cbdea513e330fbbbee00e25; firstDie=1; SESSION=396bc85c556b42bd890c‐c20adaaa1e47; UserName=weixin_42565646; UserInfo=d34ab5352bfa4f21b1eb68cdacd74768;UserToken=d34ab5352bfa4f21b1eb68cdacd74768;UserNick=weixin_42565646;AU=7A5;UN=weixin_42565646;BT=1571800370777;p_uid=U000000;dc_tos=pzt4xf;Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1571800372;Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=1788*1*PC_VC!6525*1*10_192846913701530006813444566189!5744*1*weixin_42565646;announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblogdev.blog.csdn.net%252Farticle%252Fdetails%252F102605809%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D',
        # 'Host': 'blog.csdn.net',
        # 'Referer': 'https://passport.csdn.net/login?code=public',
        # 'Sec‐Fetch‐Mode': 'navigate',
        # 'Sec‐Fetch‐Site': 'same‐site',
        # 'Sec‐Fetch‐User': '?1',
        # 'Upgrade‐Insecure‐Requests': '1',
        'User‐Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/77.0.3865.120 Safari/537.36',
    
    
try:
    request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
    content = response.read().decode('utf‐8')
    print(content)
except urllib.error.HTTPError:
    print(1111)
except urllib.error.URLError:
    print(2222)

八、Handler处理器

  • 基本使用

    import urllib.request
    
    url = 'http://www.baidu.com'
    headers = 
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    
    
    reqObj = urllib.request.Request(url=url,headers=headers)
    handler = urllib.request.HTTPHandler()
    opener = urllib.request.build_opener(handler)
    resp = opener.open(reqObj)
    context = resp.read().decode("utf-8")
    print(context)
    
  • 使用代理

import random
import urllib.parse
import urllib.request


#随机代理池
proxies_pool = [
    "http":'61.163.32.88:312811',
    "http":'61.163.32.88:3128222'
 ]

proxies = random.choice(proxies_pool)
print(proxies)


url = 'http://www.baidu.com/s?wd=ip'
headers = 
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',


reqObj = urllib.request.Request(url=url,headers=headers)
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener(handler)
resp = opener.open(reqObj)
context = resp.read().decode("utf-8")

with open("代理.html",'w',encoding='utf-8') as fp:
    fp.write(context)

以上是关于Day534.Urllib爬虫 -python的主要内容,如果未能解决你的问题,请参考以下文章

day51——爬虫

爬虫day 04(通过登录去爬虫 解决django的csrf_token)

爬虫----day05()

day5 反爬虫和Xpath语法

爬虫----day04()

爬虫基础02-day24