多进程爬虫
Posted xuxaut-558
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了多进程爬虫相关的知识,希望对你有一定的参考价值。
多进程简介
一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序
为什么学习多线程
提升爬虫效率
多进程和多线程的区别
工厂 ==> 车间 ==> 工人
多进程的使用方法
1 from multiprocessing import Pool 2 pool = Pool(processes=4) 3 pool.map(func,iterable)
性能对比
1 import re 2 import time 3 from multiprocessing import Pool 4 ? 5 import requests 6 ? 7 headers = { 8 ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0‘ 9 } 10 ? 11 def re_scraper(url): 12 res = requests.get(url,headers=headers) 13 names = re.findall(‘<h2>(.*?)</h2>‘, res.text, re.S) 14 contents = re.findall(‘<div class="content">.*?<span>(.*?)</span>‘, res.text, re.S) 15 laughs = re.findall(‘<i class="number">(d+)</i>‘,res.text,re.S) 16 comments = re.findall(‘<i class="number">(d+)</i>‘, res.text, re.S) 17 infos = list() 18 for name,content,laugh,comment in zip(names,contents,laughs,comments): 19 info = { 20 ‘name‘:name, 21 ‘content‘:content, 22 ‘laugh‘:laugh, 23 ‘comment‘:comment 24 } 25 infos.append(info) 26 return infos 27 ? 28 if __name__ == "__main__": 29 urls = [‘https://www.qiushibaike.com/8hr/page/{}/‘.format(str(i)) for i in range(1, 36)] 30 start_1 = time.time() 31 for url in urls: 32 re_scraper(url) 33 end_1 = time.time() 34 print(‘串行爬虫耗时:‘,end_1 - start_1) 35 ? 36 start_2 = time.time() 37 pool = Pool(processes=2) 38 pool.map(re_scraper,urls) 39 end_2 = time.time() 40 print(‘2进程爬虫耗时:‘,end_2 - start_2) 41 ? 42 start_3 = time.time() 43 pool = Pool(processes=4) 44 pool.map(re_scraper,urls) 45 end_3 = time.time() 46 print(‘4进程爬虫耗时:‘,end_3 - start_3) 47
1 运行结果: 2 3 [Running] python "f:WWW est_pycompare_test.py" 4 串行爬虫耗时: 14.95523715019226 5 2进程爬虫耗时: 11.39123272895813 6 4进程爬虫耗时: 4.0303635597229 7 8 [Done] exited with code=0 in 32.827 seconds
以上是关于多进程爬虫的主要内容,如果未能解决你的问题,请参考以下文章