如何提高 aiohttp 爬虫速度?

Posted

技术标签:

【中文标题】如何提高 aiohttp 爬虫速度?【英文标题】:How can I improve the aiohttp crawler speed? 【发布时间】:2020-10-22 13:27:49 【问题描述】:
import aiohttp
from bs4 import BeautifulSoup
from xlrd import open_workbook
from xlwt import Workbook

url_list = [https://www.facebook.com,https://www.baidu.com,https://www.yahoo.com,...]
#There are more than 20000 different websites in the list
#Some websites may not be accessible
keywords=['xxx','xxx'....]
start = time.time()
localtime = time.asctime(time.localtime(time.time()))
print("start time :", localtime)
choose_url=[]
url_title=[]
async def get(url, session):
    try:
        async with session.get(url=url,timeout=0) as response:
            resp = await response.text()
            soup = BeautifulSoup(resp, "lxml")
            title = soup.find("title").text.strip()
            for keyword in keywords:
                if keyword in title:
                    choose_url.append(url)
                    url_title.append(title)
                    print("Successfully got url  with resp's name .".format(url, title))
                    break
    except Exception as e:
        pass

async def main(urls):
    connector = aiohttp.TCPConnector(ssl=False,limit=0,limit_per_host =0)
    session = aiohttp.ClientSession(connector=connector)
    ret = await asyncio.gather(*[get(url, session) for url in urls])
    print("Finalized all. Return is a list of outputs.")
    await session.close()
def write_exccel(choose_url,url_title):
    #write choose_url,url_title to excel 
    pass

asyncio.run(main(url_list))
write_exccel(choose_url,url_title)
localtime = time.asctime(time.localtime(time.time()))
print("now time is  :", localtime)
end = time.time()
print('time used:', end - start)

我有 20000 个网址要请求。但是需要很长的时间(超过4或5个小时)。如果我使用请求+多处理(池4)只需要3个小时。

我试过用aiohttp+multiprocessing,好像不行。通过优化此代码或使用任何可用技术,代码能否尽可能快?谢谢

【问题讨论】:

【参考方案1】:

不知道下面的方法快不快。

import time
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils

class MySpider(Spider):
    name = 'demo_spider'
    start_urls = ["https://www.facebook.com","https://www.baidu.com","https://www.yahoo.com"]  # Entry page
    keywords = ['xxx','xxx']
    choose_url=[]
    url_title=[]
    concurrencyPer1s = 10
    def extract(self, url, html, models, modelNames):
        doc = SimplifiedDoc(html)
        title = doc.title
        if title.containsOr(self.keywords):
            self.choose_url.append(url.url)
            self.url_title.append(title.text)
            print("Successfully got url  with resp's name .".format(url, title.text))
    def urlCount(self):
        count = Spider.urlCount(self)
        if count==0:
            SimplifiedMain.setRunFlag(False)
        return count

start = time.time()
localtime = time.asctime(time.localtime(time.time()))
print("start time :", localtime)
SimplifiedMain.startThread(MySpider(),"concurrency":600, "concurrencyPer1S":100, "intervalTime":0.001, "max_workers":10)  # Start download
localtime = time.asctime(time.localtime(time.time()))
print("now time is  :", localtime)
end = time.time()
print('time used:', end - start)

【讨论】:

以上是关于如何提高 aiohttp 爬虫速度?的主要内容,如果未能解决你的问题,请参考以下文章

为 aiohttp 爬虫注入灵魂

aiohttp高并发爬虫框架

Python爬虫入门: 蜂鸟网图片爬取之二

爬虫-aiohttp 模块的简单使用

aiohttp中request.iter_content()的等效方法是啥?

小白学 Python 爬虫(32):异步请求库 AIOHTTP 基础入门