基于单线程+多任务异步协程实现异步爬取

Posted 2020-12-09 hedger-lee

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了基于单线程+多任务异步协程实现异步爬取相关的知识，希望对你有一定的参考价值。

基于单线程+多任务异步协程实现异步爬取

使用asyncio加上aiohttp

协程对象

协程：对象，可以把协程当做是一个特殊的函数，如果一个函数的定义被async关键字所修饰，该特殊的函数被调用后函数内部的程序语句不会被立即执行，而是会返回一个协程对象。

from time import sleep
import asyncio

async def get_request(url):
    print(‘正在请求:‘,url)
    sleep(2)
    print(‘请求结束:‘,url)

c = get_request(‘www.1.com‘)
print(c)
‘‘‘
<coroutine object get_request at 0x0000020BA9CB96C8>
‘‘‘

任务对象

任务对象(task)：所谓的任务对象就是对协程对象的进一步封装，在任务对象中可以实现显示协程对象的运行状况。

任务对象最终是需要被注册到事件循环对象中。

绑定回调

绑定回调：回调函数是绑定给任务对象，只有当任务对象对应的特殊函数被执行完毕后，回调函数才会被执行

事件循环对象

事件循环对象：无限循环的对象，也可以把其当成是某一种容器。

该容器中需要放置多个任务对象(就是一组待执行的代码块)。

异步的体现：当事件循环开启后，该对象会安装顺序执行每一个任务对象，当一个任务对象发生了阻塞事件循环是不会等待，而是直接执行下一个任务对象

from time import sleep
import asyncio

#回调函数:
#默认参数:任务对象
def callback(task):
    print(‘i am callback!!1‘)
    print(task.result())#result返回的就是任务对象对应的那个特殊函数的返回值

async def get_request(url):
    print(‘正在请求:‘,url)
    sleep(2)
    print(‘请求结束:‘,url)
    return ‘hello bobo‘

#创建一个协程对象
c = get_request(‘www.1.com‘)
#封装一个任务对象
task = asyncio.ensure_future(c)

#给任务对象绑定回调函数
task.add_done_callback(callback)

#创建一个事件循环对象
loop = asyncio.get_event_loop()
loop.run_until_complete(task)#将任务对象注册到事件循环对象中并且开启了事件循环

await

挂起的操作，交出cpu的使用权，需要主动在阻塞前加上await

多任务异步协程

注意事项:

? 1.将多个任务对象存储到一个列表中，然后将该列表注册到事件循环中，在注册的过程中，该列表需要被wait方法进行处理。

? 2.在任务对象对应的特殊函数内部的实现中，不可以出现不支持异步模块的代码，否则就会中断整个的异步效果。

? 并且，在该函数内部每一组阻塞的操作都必须使用await关键字进行修饰。

? 3.requests模块对应的代码不可以出现在特殊函数内部，因为requests是一个不支持异步的模块。

import asyncio
from time import sleep
import time
start = time.time()
urls = [
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘
]

#在待执行的代码块中不可以出现不支持异步模块的代码
#在该函数内部如果有阻塞操作必须使用await关键字进行修饰
async def get_request(url):
    print(‘正在请求:‘,url)
    await asyncio.sleep(2)
    print(‘请求结束:‘,url)
    return ‘hello bobo‘

tasks = [] #放置所有的任务对象
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

aiohttp模块的使用

单纯使用requests模块发送请求，不能实现异步效果，这是因为requests模块是一个不支持异步的模块

import asyncio
import requests
import time
start = time.time()
urls = [
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘
]
#无法实现异步的效果:是因为requests模块是一个不支持异步的模块
async def req(url):
    page_text = requests.get(url).text
    return page_text

tasks = []
for url in urls:
    c = req(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

使用aiohttp模块发送请求

import asyncio
import requests
import time
import aiohttp
from lxml import etree
urls = [
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘,
    ‘http://localhost:5000/bobo‘,
]
# 可以实现异步效果
async def req(url):
    async with aiohttp.ClientSession() as s:
        async with await s.get(url) as response:
            #response.read():byte
            page_text = await response.text()
            return page_text

    #细节:在每一个with前面加上async,在每一步的阻塞操作前加上await

def parse(task):
    page_text = task.result()
    tree = etree.html(page_text)
    name = tree.xpath(‘//p/text()‘)[0]
    print(name)
if __name__ == ‘__main__‘:
    start = time.time()
    tasks = []
    for url in urls:
        c = req(url)
        task = asyncio.ensure_future(c)
        task.add_done_callback(parse)
        tasks.append(task)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))

    print(time.time()-start)

使用aiohttp模块进行异步网络请求的步骤

1.写出初步架构

async def req(url):
    with aiohttp.ClientSession() as s:
        with  s.get(url) as response:
            #response.read():byte
            page_text =  response.text()
            return page_text

2.补充细节

在每一个with前面加上async,在每一步的阻塞操作前加上await

async def req(url):
async with aiohttp.ClientSessio() as s:
    async with await s.get(url) as response:
        #response.read():byte
        page_text = await response.text()
        return page_text

以上是关于基于单线程+多任务异步协程实现异步爬取的主要内容，如果未能解决你的问题，请参考以下文章