高性能异步爬虫
Posted jnhnsnow
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了高性能异步爬虫相关的知识,希望对你有一定的参考价值。
get方法会阻塞
异步爬虫方式:
- 多线程 多进程(不建议)
好处:可以为相关阻塞操作单独开启线程,进程,实现异步
坏处:无法无限制开启多线程或多进程
- 线程池 进程池(适当使用)
好处:降低系统对进程或线程创建和销毁频率,降低系统开销
坏处: 池中线程或进程数量有上线 (阻塞远远高于池中线程,进程时,提升效率不明显)
原则:处理的是阻塞且耗时的操作
线程池的基本使用:
from multiprocessing.dummy import Pool import time stari_time = time.time() def f1(name): print("%s is running"%name) time.sleep(2) print("%s running done"%name) #实例化线程池对象 name_list = [‘a‘,‘b‘,‘c‘,‘d‘] pool = Pool(4) #pool.map(func,iterable) pool.map(f1,name_list) print(time.time()-stari_time)
线程池案例应用:
- 梨视频 生活板块 最热的视频数据
from multiprocessing.dummy import Pool import requests import re from lxml import etree headers = ‘User-Agent‘: ‘Mozilla/5.0 (Linux; android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36‘ url = ‘https://www.pearvideo.com/category_5‘ page_text = requests.get(url=url,headers=headers).text tree =etree.HTML(page_text) li_list = tree.xpath(‘//li[@class="categoryem"]‘) urls = []#所有视频的url for li in li_list: detail_url = ‘https://www.pearvideo.com/‘+li.xpath(‘./div/a/@href‘)[0] name = li.xpath(".//div[@class=‘vervideo-title‘]/text()")[0]+".mp4" res = requests.get(url=detail_url,headers=headers).text ex = ‘srcUrl="(.*?)",vdoUrl‘ #动态加载的数据 xpath匹配不到script标签 用正则匹配 video_url = re.findall(ex,res)[0] dic = ‘name‘:name, ‘url‘:video_url urls.append(dic) pool =Pool(5) def f1(dic): video_content = requests.get(url=dic[‘url‘],headers=headers).content print(dic[‘name‘], "正在下载") #持久化存储操作 with open(dic[‘name‘],‘wb‘)as f: f.write(video_content) print(dic[‘name‘],"下载成功") pool.map(f1,urls) pool.close() pool.join()
以上是关于高性能异步爬虫的主要内容,如果未能解决你的问题,请参考以下文章