起点中文网小说爬取-etree，xpath，os

Posted 2021-02-10 fodalaoyao

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了起点中文网小说爬取-etree，xpath，os相关的知识，希望对你有一定的参考价值。

本文章主要是lxml库的etree解析抽取与xpath解析的应用，还使用了os库写文件

import os
import requests
from lxml import etree#lxml库解析html、xml文件抽取想要的数据
#设计模式--面向对象
class Spider(object):
    def start_request(self):
        #1.请求网站拿到数据，抽取小说名创建文件夹，抽取
        response=requests.get(‘https://www.qidian.com/all‘)
        # print(response.text)
        html=etree.HTML(response.text)#结构化
        Bigsrc_list=html.xpath(‘//div[@class="book-mid-info"]/h4/a/@href‘)
        #//:选取元素，而不考虑元素的具体位置;     @:选取属性
        Bigtit_list=html.xpath(‘//div[@class="book-mid-info"]/h4/a/text()‘)
        # print(Bigsrc_list,Bigtit_list)
        for Bigsrc,Bigtit in zip(Bigsrc_list,Bigtit_list):
            if os.path.exists(Bigtit)==False:#如果不存在文件夹名为该小说名，就创建
                os.mkdir(Bigtit)
            # print(Bigsrc,Bigtit)
            self.file_data(Bigsrc,Bigtit)#调用下一个函数

    def file_data(self, Bigsrc, Bigtit):
        # 2.请求小说，拿到数据，抽取章名，抽取文章链接
        response = requests.get("https:" + Bigsrc)#补上缺少的https前缀
        html = etree.HTML(response.text)#etree.HTML可用于在python代码中嵌入“html文本”。
        listsrc_list = html.xpath(‘//ul[@class="cf"]/li/a/@href‘)
        listtit_list = html.xpath(‘//ul[@class="cf"]/li/a/text()‘)
        for Listsrc, Listtit in zip(listsrc_list, listtit_list):
            # print(Listsrc, Listtit)
            self.finally_file(Listsrc, Listtit,Bigtit)

    def finally_file(self,Listsrc, Listtit,Bigtit):
        # 3.请求文章拿到抽取文章内容，创建文件保存到相应文件夹
        response=requests.get("https:"+Listsrc)
        html=etree.HTML(response.text)#结构化
        content="
".join(html.xpath(‘//div[@class="read-content j_readContent"]/p/text()‘))
        #S.join()返回一个字符串,元素之间的分隔符是S
        file_name=Bigtit+"\"+Listtit+".txt"
        #创建Bigtit文件夹下的Listtit.txt文件
        print("正在存储文件"+file_name)
        with open(file_name,"a",encoding="utf-8")as f:
            f.write(content)
if __name__==‘__main__‘:
    spider = Spider()
    spider.start_request()

以上是关于起点中文网小说爬取-etree，xpath，os的主要内容，如果未能解决你的问题，请参考以下文章

python实战爬取起点中文网自制小说阅读器

Python爬取起点中文网小说信息及封面图片

爬取起点中文网小说介绍信息

python爬虫_第三课_聚焦爬虫

爬取免费小说

Requests和Xpath笔趣阁小说采集爬取教程