爬虫之由性能说起

Posted 零度大白

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫之由性能说起相关的知识,希望对你有一定的参考价值。

性能相关

在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

技术分享图片
import requests

def fetch_async(url):
    response = requests.get(url)
    return response


url_list = [http://www.github.com, http://www.bing.com]

for url in url_list:
    fetch_async(url)
1.同步执行
技术分享图片
from concurrent.futures import ThreadPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


url_list = [http://www.github.com, http://www.bing.com]
pool = ThreadPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)
pool.shutdown(wait=True)
2.多线程执行
技术分享图片
from concurrent.futures import ThreadPoolExecutor
import requests

def fetch_async(url):
    response = requests.get(url)
    return response


def callback(future):
    print(future.result())


url_list = [http://www.github.com, http://www.bing.com]
pool = ThreadPoolExecutor(5)
for url in url_list:
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
pool.shutdown(wait=True)
2.多线程+回调函数执行
技术分享图片
from concurrent.futures import ProcessPoolExecutor
import requests

def fetch_async(url):
    response = requests.get(url)
    return response


url_list = [http://www.github.com, http://www.bing.com]
pool = ProcessPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)
pool.shutdown(wait=True)
3.多进程执行
技术分享图片
from concurrent.futures import ProcessPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


def callback(future):
    print(future.result())


url_list = [http://www.github.com, http://www.bing.com]
pool = ProcessPoolExecutor(5)
for url in url_list:
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
pool.shutdown(wait=True)
3.多进程+回调函数执行

通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO回事首选:

 

以上是关于爬虫之由性能说起的主要内容,如果未能解决你的问题,请参考以下文章

ASP.NET Core之由配置系统与创建app所想到的

scrapy按顺序启动多个爬虫代码片段(python3)

我们的爬虫从pyspider开始说起

scrapy主动退出爬虫的代码片段(python3)

小白学react之由FOUC引发的一次webpack变革

pybitcointools源码分析之由私钥获取公钥