异步爬虫圆我四大名著《西游记》之梦

Posted 2021-08-18 ZSYL

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了异步爬虫圆我四大名著《西游记》之梦相关的知识，希望对你有一定的参考价值。

名著《西游记》爬虫实战

前言
1. 获取所有章节cid
2. 获取单个章节内容
3. 完整代码

前言

一直都想找个机会好好读读四大名著，今天我下载了 《西游记》 全部章节，一起开始阅读吧！

1. 获取所有章节cid

名著所在网址：https://dushu.baidu.com/pc/detail?gid=4306063500

在这里插入图片描述

当我们检查网页源代码时，未发现章节信息，所以该数据可能是服务器二次发送的。

在这里插入图片描述

F12，打开开发者工具，Network，刷新进行Fetch/XHR数据抓包。

在这里插入图片描述

找到章节信息后，copy URL，可以看到请求小说的章节数据与book_id有关

在这里插入图片描述

# 获取所有章节的cid
async def getCatalog(url):
    resp = requests.get(url)
    dict = resp.json()
    # 创建任务列表
    tasks = []
    for item in dict['data']['novel']['items']:  # item就是对应每一个章节的名称和cid
        title = item['title']
        cid = item['cid']
        print(title, cid)
        tasks.append(aiodownload(book_id, cid, title))
    await asyncio.wait(tasks)

在这里用到异步协程，提高效率，方法前用async修饰，await，线程阻塞，释放CPU资源。

2. 获取单个章节内容

进入某一章节，进行分析，同样 F12，Network，数据抓包，进行缜密分析，终于找到文本内容。

在这里插入图片描述

进入Headers，copyURL，获取requests.get().json()的数据。

在这里插入图片描述

# 准备异步任务
async def aiodownload(book_id, cid, title):
    data = {
        "book_id": book_id,
        "cid": f"{book_id}|{cid}",
        "need_bookinfo": 1
    }
    # data需要为json()
    data = json.dumps(data)

    url = f"https://dushu.baidu.com/api/pc/getChapterContent?data={data}"
    # 获取协程请求对象
    async with aiohttp.ClientSession() as session:  # requests
        async with session.get(url) as resp:
            dic = await resp.json()  # 阻塞

            # 获取异步io文件流
            async with aiofiles.open('./novels/'+title+'.txt', mode='w', encoding='utf-8') as f:
                await f.write(dic['data']['novel']['content'])  # 把小说内容写出
                print(title, '下载完成!')

3. 完整代码

# 小说url
# url = 'https://dushu.baidu.com/api"/pc/getCatalog?data:{"book_id":"4306063500"}' =>所有章节名称链接
# 章节内部内容url
# url = https://dushu.baidu.com/api/pc/getChapterContent?data={"book_id":"4306063500", "cid":"4306063500"|11348571","need_bookinfo":"1"}

import requests
import asyncio
import aiohttp
import aiofiles
import json
"""
1. 同步操作，访问getCatalog 拿到所有章节的cid和名称
2. 异步操作，访问getChapterContent 下载所有的文章内容
"""

# 准备异步任务
async def aiodownload(book_id, cid, title):
    data = {
        "book_id": book_id,
        "cid": f"{book_id}|{cid}",
        "need_bookinfo": 1
    }
    # data需要为json()
    data = json.dumps(data)

    url = f"https://dushu.baidu.com/api/pc/getChapterContent?data={data}"
    # 获取协程请求对象
    async with aiohttp.ClientSession() as session:  # requests
        async with session.get(url) as resp:
            dic = await resp.json()  # 阻塞

            # 获取异步io文件流
            async with aiofiles.open('./novels/'+title+'.txt', mode='w', encoding='utf-8') as f:
                await f.write(dic['data']['novel']['content'])  # 把小说内容写出
                print(title, '下载完成!')


# 获取所有章节的cid
async def getCatalog(url):
    resp = requests.get(url)
    dict = resp.json()
    # 创建任务列表
    tasks = []
    for item in dict['data']['novel']['items']:  # item就是对应每一个章节的名称和cid
        title = item['title']
        cid = item['cid']
        print(title, cid)
        tasks.append(aiodownload(book_id, cid, title))
    await asyncio.wait(tasks)


if __name__ == '__main__':
    # book_id
    book_id = "4306063500"  # 西游记
    url = 'https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"' + book_id + '"}'
    # 启动协程
    asyncio.run(getCatalog(url))

在这里插入图片描述

在这里插入图片描述
加油!

感谢!

努力!

以上是关于异步爬虫圆我四大名著《西游记》之梦的主要内容，如果未能解决你的问题，请参考以下文章

西游记

四大名著知识图谱可视化

java四大名著是哪些书

用python爬取去哪儿游记攻略为十月假期做准备。。。爬虫之路，永无止境！

从古代名著看领导与被领导的艺术

Python爬虫实战，Scrapy实战，爬取旅行家游记信息