教你用Python写一个爬虫，免费看小说

Posted 2023-03-31

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了教你用Python写一个爬虫，免费看小说相关的知识，希望对你有一定的参考价值。

参考技术A 这是一个练习作品。用python脚本爬取笔趣阁上面的免费小说。

环境：python3
类库：BeautifulSoup
数据源： http://www.biqukan.cc

原理就是伪装正常http请求，正常访问网页。然后通过bs4重新解析html结构来提取有效数据。

包含了伪装请求头部，数据源配置（如果不考虑扩展其他数据源，可以写死）。

config.py文件

fiction.py文件

summary.py文件

catalog.py文件

article.py文件

暂没有做数据保存模块。如果需要串起来做成一个完整的项目的话，只需要把小说数据结构保存即可（节省磁盘空间）。通过小说url可以很快速的提取出小说简介、目录、每一章的正文。

如果想要做的更好，可以把目录，介绍、正文等部分缓存起来，当然得有足够的空间。

教你用Python批量爬取小说！这年头了谁看小说还充钱啊!

爬取小说的思路

首先获取小说的地址。
分析目录地址结构。
进行地址的拼接。
分析章节内容结构。
获取并保存文本。
完整代码

1.获取小说地址

加载需要的包：

import re
from bs4 import BeautifulSoup as ds
import requests

获取小说目录文件，返回<Response [200]>，表示可正常爬取该网页

base_url='https://www.soshuw.com/XuLiangShangYouWangFei/'
chapter_html=requests.get(base_url)
print(chapter_html)

2.分析小说地址结构

解析目录网页 , 输出结果为目录网页的源代码

chapter_page_html=ds(chapter_page,'lxml')
print(chapter_page)

打开目录网页，发现在正文的目录前面有一个最新章节目录（这里有九个章节），再完整的目录中是包含最新章节的，所以这里最新章节是不需要的。在这里插入图片描述
在网页单击右键选择“检查”（或者“属性”，不同的浏览器的叫法不一致，我用的是IE）选择“元素”列，鼠标再右侧代码块上移动时。左侧网页会高亮显示其对应网页区域，找到完整目录对应的代码块。如下图：在这里插入图片描述
完整目录的锚有两个，分别是class="novel_list"和id=“novel108799”,仔细观察后发现class不唯一，所以我们选用id提取该块内容

将完整目录块提取出来

chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)

结果如下（仅部分结果）：在这里插入图片描述
对比小说章节内容网址和目录网址（base_url）发现，我们只需要将base_url和章节内容网址的后半段拼接到一起就可以得到完整的章节内容网址

3.拼接地址

利用正则语言库将地址后半段提取出来

chapter_novel_str=str(chapter_novel)
regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"'
chapter_href_list = re.findall(regx, chapter_novel_str)
print(chapter_href_list)

拼接url:
定义一个列表chapter_url_list接收完整地址

chapter_url_list = []
for i in chapter_href_list:
    url=base_url+i
    chapter_url_list.append(url)
print(chapter_url_list)

4.分析章节内容结构

打开章节，右键→“属性”，查看内容结构，发现小说正文有class和id两个锚，class是不变的，id随着章节而变化，所以我们用class提取正文
在这里插入图片描述
提取正文段

chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)

提取正文文本和标题

body_html=requests.get('https://www.soshuw.com/XuLiangShangYouWangFei/3647144.html')
body_page=ds(body_html.content,'lxml')
body = body_page.find(class_='content')
body_content=str(body)
print(body_content)
body_regx='<br/>    (.*?)\\n'
content_list=re.findall(body_regx,body_content)
print(content_list)
title_regx = '<h1>(.*?)</h1>'
title = re.findall(title_regx, body_html.text)
print(title)

5.保存文本

with open('1.txt', 'a+') as f:
    f.write('\\n\\n')
    f.write(title[0] + '\\n')
    f.write('\\n\\n')
    for e in content_list:
        f.write(e + '\\n')
print('{} 爬取完毕'.format(title[0]))

6.完整代码

import re
from bs4 import BeautifulSoup as ds
import requests
base_url='https://www.soshuw.com/XuLiangShangYouWangFei'
chapter_html=requests.get(base_url)
chapter_page=ds(chapter_html.content,'lxml')
chapter_novel=chapter_page.find(id="novel108799")
#print(chapter_novel)
chapter_novel_str=str(chapter_novel)
regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"'
chapter_href_list = re.findall(regx, chapter_novel_str)
#print(chapter_href_list)
chapter_url_list = []
for i in chapter_href_list:
    url=base_url+i
    chapter_url_list.append(url)
#print(chapter_url_list)

for u in chapter_url_list:
    body_html=requests.get(u)
    body_page=ds(body_html.content,'lxml')
    body = body_page.find(class_='content')
    body_content=str(body)
   # print(body_content)
    body_regx='<br/>    (.*?)\\n'
    content_list=re.findall(body_regx,body_content)
    #print(content_list)
    title_regx = '<h1>(.*?)</h1>'
    title = re.findall(title_regx, body_html.text)
    #print(title)
    with open('1.txt', 'a+') as f:
        f.write('\\n\\n')
        f.write(title[0] + '\\n')
        f.write('\\n\\n')
        for e in content_list:
            f.write(e + '\\n')
    print('{} 爬取完毕'.format(title[0]))

最后说明下此文仅作学习交流，不可商用

如有疑问，欢迎在评论区一起讨论！

如有不正确的地方，欢迎指导！

以上是关于教你用Python写一个爬虫，免费看小说的主要内容，如果未能解决你的问题，请参考以下文章