多进程爬虫

Posted xuxaut-558

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了多进程爬虫相关的知识,希望对你有一定的参考价值。

多进程简介

一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序


为什么学习多线程

提升爬虫效率


多进程和多线程的区别

工厂 ==> 车间 ==> 工人


多进程的使用方法

1 from multiprocessing import Pool
2 pool = Pool(processes=4)
3 pool.map(func,iterable)
 

性能对比

爬取url:https://www.qiushibaike.com/8hr/page/1/

 1 import re
 2 import time
 3 from multiprocessing import Pool
 4 ?
 5 import requests
 6 ?
 7 headers = {
 8     User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
 9 }
10 ?
11 def re_scraper(url):
12     res = requests.get(url,headers=headers)
13     names = re.findall(<h2>(.*?)</h2>, res.text, re.S)
14     contents = re.findall(<div class="content">.*?<span>(.*?)</span>, res.text, re.S)
15     laughs = re.findall(<i class="number">(d+)</i>,res.text,re.S)
16     comments = re.findall(<i class="number">(d+)</i>, res.text, re.S)
17     infos = list()
18     for name,content,laugh,comment in zip(names,contents,laughs,comments):
19         info = {
20             name:name,
21             content:content,
22             laugh:laugh,
23             comment:comment
24         }
25         infos.append(info)
26     return infos
27 ?
28 if __name__ == "__main__":
29     urls = [https://www.qiushibaike.com/8hr/page/{}/.format(str(i)) for i in range(1, 36)]
30     start_1 = time.time()
31     for url in urls:
32         re_scraper(url)
33     end_1 = time.time()
34     print(串行爬虫耗时:,end_1 - start_1)
35 ?
36     start_2 = time.time()
37     pool = Pool(processes=2)
38     pool.map(re_scraper,urls)
39     end_2 = time.time()
40     print(2进程爬虫耗时:,end_2 - start_2)
41 ?
42     start_3 = time.time()
43     pool = Pool(processes=4)
44     pool.map(re_scraper,urls)
45     end_3 = time.time()
46     print(4进程爬虫耗时:,end_3 - start_3)
47  

 


1 运行结果:
2 
3 [Running] python "f:WWW	est_pycompare_test.py"
4 串行爬虫耗时: 14.95523715019226
5 2进程爬虫耗时: 11.39123272895813
6 4进程爬虫耗时: 4.0303635597229
7 
8 [Done] exited with code=0 in 32.827 seconds

 


以上是关于多进程爬虫的主要内容,如果未能解决你的问题,请参考以下文章

代码片段:Shell脚本实现重复执行和多进程

Python爬虫案例演示:Python多线程多进程协程

爬虫:多进程爬虫

有没有易懂的 Python 多线程爬虫代码

Python爬虫实例多进程下载金庸网小说

Python爬虫编程思想(138):多线程和多进程爬虫--从Thread类继承