基于asyncio的轻量异步微爬虫框架

Posted 明柳梦少

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了基于asyncio的轻量异步微爬虫框架相关的知识,希望对你有一定的参考价值。

安装

一个异步网络抓取微框架,用asyncio和aiohttp编写,旨在使抓取网址尽可能方便。

# For Linux & Mac
pip install -U aspider[uvloop]

# For Windows
pip install -U aspider

# New features
pip install git+https://github.com/howie6879/aspider

用法

请求和响应

我们提供了一个简单request的网址方式,并返回一个友好的response:

import asyncio

from aspider import Request

request = Request("https://news.ycombinator.com/")
response = asyncio.get_event_loop().run_until_complete(request.fetch())

# Output
# [2018-07-25 11:23:42,620]-Request-INFO  <GET: https://news.ycombinator.com/>
# <Response url[text]: https://news.ycombinator.com/ status:200 metadata:{}>

javascript支持:

request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
print(response.body)

使用时需要注意load_js,它会下载最新版本的Chromium(~100MB)。这曾发生过一次。

项目

我们来看一个Item用于提取目标数据的快速示例。首先将以下内容添加到demo.py中:

import asyncio

from aspider import AttrField, TextField, Item

class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value

items = asyncio.get_event_loop().run_until_complete(HackerNewsItem.get_items(url="https://news.ycombinator.com/"))
for item in items:
    print(item.title, item.url)

执行: python demo.py

Notorious ‘Hijack Factory’ Shunned from Web https://krebsonsecurity.com/2018/07/notorious-hijack-factory-shunned-from-web/
 .....

Spider

对于多个页面,您可以使用 Spider解决这个问题。

import aiofiles

from aspider import AttrField, TextField, Item, Spider

class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value

class HackerNewsSpider(Spider):
    start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2']

    async def parse(self, res):
        items = await HackerNewsItem.get_items(html=res.html)
        for item in items:
            async with aiofiles.open('./hacker_news.txt', 'a') as f:
                await f.write(item.title + '\n')

if __name__ == '__main__':
    HackerNewsSpider.start()
Run hacker_news_spider.py:

[2018-07-11 17:50:12,430]-aspider-INFO  Spider started!
[2018-07-11 17:50:12,430]-Request-INFO  <GET: https://news.ycombinator.com/>
[2018-07-11 17:50:12,456]-Request-INFO  <GET: https://news.ycombinator.com/news?p=2>
[2018-07-11 17:50:14,785]-aspider-INFO  Time usage: 0:00:02.355062
[2018-07-11 17:50:14,785]-aspider-INFO  Spider finished!

运行hacker_news_spider.py:

[2018-07-11 17:50:12,430]-aspider-INFO  Spider started!
[2018-07-11 17:50:12,430]-Request-INFO  <GET: https://news.ycombinator.com/>
[2018-07-11 17:50:12,456]-Request-INFO  <GET: https://news.ycombinator.com/news?p=2>
[2018-07-11 17:50:14,785]-aspider-INFO  Time usage: 0:00:02.355062
[2018-07-11 17:50:14,785]-aspider-INFO  Spider finished!

TODO

  • 自定义中间件
  • JavaScript支持
  • 友好的回应

贡献

  • Pull Request
  • Open Issue

谢谢

demiurge

https://github.com/matiasb/demiurge

以上是关于基于asyncio的轻量异步微爬虫框架的主要内容,如果未能解决你的问题,请参考以下文章

python链家网高并发异步爬虫and异步存入数据

爬虫实战国家企业公示网-webapi实现

基于Vert.x和RxJava 2构建通用的爬虫框架

Python协程 & 异步编程(asyncio) 入门介绍

Python协程 & 异步编程(asyncio) 入门介绍

介绍一个效率爆表的数据采集框架