Python3----下载小说代码

Posted 2021-03-04 我是谁9

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python3----下载小说代码相关的知识，希望对你有一定的参考价值。

前言

该文章主要分析编写爬虫小说的思路，文章代码使用语言Python3.7

2020-03-20

天象独行

　　第一部分：面对过程编程

　　1；首先我们确定目标，我们将要从网页：“https://www.biqukan.com/38_38836”下载小说“沧元图”。这里我们先设定URL以及请求头（后面会进行调整，让程序与人进行交互），且程序所需要的库：

from urllib import request
from bs4 import BeautifulSoup
import _collections
import re
import os

\'\'\'
***********************************
获取下载连接
***********************************
\'\'\'
url = \'https://www.biqukan.com/38_38836/\'
head = {\'User-Agent\':\'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19\',}

　　2；看到整个网页，我们发现这个界面是有很多目录，那么我们输入按键“F12”去查看前段代码。发现代码中属性“href”表示这是资源的具体路径，那这个时候，我们想到，可以通过Python 3 模块urllib 发出请求，正则表达式来提取内容的方式来提取我们所需要的内容：目录名称以及资源路径。具体查看如下代码：

charter = re.compile(u\'[第弟](.+)章\', re.IGNORECASE)
target_req = request.Request(url=url,headers=head) #发送请求
#print("request.req的返回对象：",type(target_req))
#print("++++++++++++++++++++++++++++++++++++++++++")

target_response = request.urlopen(target_req) #打开response包
#print("request.response的返回对象：",type(target_response))
#print("++++++++++++++++++++++++++++++++++++++++++")

\'\'\'
打开read()用gbk编码编译
\'\'\'
target_html = target_response.read().decode(\'gbk\',\'ignore\')
#print("request.html的返回对象：",type(target_html))
#print("++++++++++++++++++++++++++++++++++++++++++")

listmain_soup = BeautifulSoup(target_html,\'lxml\') #使用“lxml”编译
#print(listmain_soup.ul.li)
#print("listmain_soup的返回对象：",type(listmain_soup))
#print("++++++++++++++++++++++++++++++++++++++++++")

#chapters 状态是列表
chapters = listmain_soup.find_all(\'div\',class_= \'listmain\') #在下载的页面当中通过方法find_all 将页面中指定位置提取出来。
#print(chapters)
#print(type(chapters))
#print("++++++++++++++++++++++++++++++++++++++++++")
#help(listmain_soup.find_all)

download_soup = BeautifulSoup(str(chapters),\'lxml\') #将页面使用“lxml”编译
#print(download_soup)
#a = str(download_soup.dl.dt).split("》")[0][5:]
#print(a)
#novel_name = str(download_soup.)
novel_name = str(download_soup.dl.dt).split("》")[0][5:] #提取小说名称
flag_name = "《" + novel_name + "》" + "正文卷" 
#print(flag_name)
#print(len(download_soup.dl))
#章节数目numbers
numbers = (len(download_soup.dl) - 1) / 2 - 8 
#print(numbers)

download_dict = _collections.OrderedDict()
begin_flag = False
numbers = 1
#print(download_soup.dl.children)

#Beautiful Soup 通过tag的.children生成器，可以对tag的子节点进行循环
#get属性的值
for child in download_soup.dl.children:
    #print(dir(child))
    #如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点:
    if child != \'\\n\':
        if child.string == u"%s" % flag_name:
            begin_flag = True
        if begin_flag == True and child.a != None:
            download_url = "https://www.biqukan.com" + child.a.get(\'href\')
            download_name = child.string
            #print(download_name)
            names = str(download_name).split(\'章\')
            #print(names)
            name = charter.findall(names[0] + \'章\')
            #print(name)
            if name:
                download_dict[\'第\' + str(numbers) + \'章 \' + names[-1]] = download_url
                numbers += 1
#print(novel_name + \'.txt\' , numbers,download_dict)
            #print(download_url)

　　3；这个时候我们成功将文章名称以及下载地址提取出来。结果如下图：

　　4；接下来，我们通过for 循环来逐一提取出文件名称以及下载地址，并且逐一下载进入文件。详细查看如下代码：

\'\'\'
***************************************
提取下载地址,提取下载内容
***************************************
\'\'\'
#print(download_dict[\'第1章  孟川和云青萍\'])
print("下载开始>>>>")
for key,value in download_dict.items():
    name_key = key
    path = \'/home/aaron/桌面/novel\' + \'//\' + name_key
    download_req_key = request.Request(url=value,headers=head)
    download_response_value = request.urlopen(download_req_key)
    download_html = download_response_value.read().decode(\'gbk\',\'ignore\')
    soup_texts_value = BeautifulSoup(download_html,\'lxml\')
    texts = soup_texts_value.find_all(id = \'content\', class_ = \'showtxt\')
    \'\'\'
    函数解析：
        1；text 提取文本
        2；replace（）方法把字符串中的 old（旧字符串） 替换成 new(新字符串)，如果指定第三个参数max，则替换不超过 max 次。
    \'\'\'
    texts = BeautifulSoup(str(texts),\'lxml\').div.text.replace(\'app2();\' , \'\')
    texts = texts.encode()
    #print(type(texts))
    file = os.open(path,os.O_CREAT|os.O_RDWR|os.O_APPEND)
    os.write(file,texts)
    #os.close(file)
    print("%s 下载完成" ,name_key)

　　5；程序运行结果：

　　第二部分：面对对象思想

　　1；上面程序已经完成了任务，但是这样的代码没有交互，也没有这么方便阅读，那么我们将代码稍微调整一下。首先，我们将必要的，可重复的使用的功能都放在函数里面调用。如下：

# -*- coding:UTF-8 -*-
from urllib import request
from bs4 import BeautifulSoup
import _collections
import re
import os

\'\'\'
类：下载《笔趣看》网小说: url:https://www.biqukan.com/
\'\'\'
class download(object):
    def __init__(self,target):
        self.target_url = target
        self.head = {\'User-Agent\':\'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19\',}
    \'\'\'
    方法：获取下载链接
    Parameters: 无
    Returns:
        1；文件名称
        2；下载信息地址
        3；小说名称
    \'\'\'
    def download_url(self):
        download_req = request.Request(url=self.target_url, headers = self.head)
        download_response = request.urlopen(download_req)
        download_html = download_response.read().decode(\'gbk\',\'ignore\')
        \'\'\'
        1；目前HTML已经使用url库发送了请求，接下来使用lxml编译，这么做是让我们更好的去提取网页当中我们需要的内容
        \'\'\'
        listmain_soup = BeautifulSoup(download_html,\'lxml\')
        \'\'\'
        2；我们通过BeautifulSoup方法，使用\'lxml\'编译，接下来我们我们来提取文件名称
        \'\'\'
        chapters = listmain_soup.find_all(\'div\',class_ = \'listmain\')
        download_soup = BeautifulSoup(str(chapters),\'lxml\')
        \'\'\'
        3；提取文件名称
        \'\'\'
        novel_name = str(download_soup.dl.dt).split("》")[0][5:]
        \'\'\'
        4；设定for循环所需要的参数
        \'\'\'
        flag_name = \'《\' + novel_name + \'》\' + \'正文卷\'
        begin_flag = False    #设定开始运行下载的位置
        download_dict = _collections.OrderedDict()  #设定有序字典，用来存放文件名称和下载链接地址
        numbers = 1 # 设定章节内容

        \'\'\'
        4；利用for循环来提取出文件名称以及下载地址
        \'\'\'
        for child in download_soup.dl.children:
            if child != \'\\n\':  #排除空格
                if child.string == u"%s" % flag_name: #设定开始下载位置
                    begin_flag = True
                if begin_flag == True and child.a != None: #符合有下载链接地址的条件才会下载
                    download_url = "https://www.biqukan.com" + child.a.get(\'href\') # get() 用来提取属性内容
                    download_name = child.string
                    \'\'\'
                    5；重新修改文件名称
                    \'\'\'
                    names = str(download_name).split(\'章\')
                    name = charter.findall(names[0] + \'章\')
                    \'\'\'
                    6；将文件名称以及下载链接放入字典当中
                    \'\'\'
                    if name:
                        download_dict[\'第\' + str(numbers) + \'章 \' + names[-1]] = download_url
                        numbers += 1
        return novel_name + \'.txt\' , numbers , download_dict

    \'\'\'
    函数说明：爬文章内容
    Parameters:
        url
    Retruns:
        soup_text
    \'\'\'
    def Downloader(self,url):
        download_req = request.Request(url=self.target_url, headers = self.head)
        download_response = request.urlopen(download_req)
        download_html = download_response.read().decode(\'gbk\',\'ignore\')
        soup_texts = BeautifulSoup(download_html,\'lxml\')
        texts = soup_texts.find_all(id=\'content\', class_=\'showtxt\')
        soup_text = BeautifulSoup(str(texts), \'lxml\').div.text.replace(\'app2();\', \'\')
        return soup_text

    \'\'\'
    函数说明：将爬去的文件内容写入文件
    \'\'\'
    def writer(self,name,path,text):
        write_flag = True  #设定文件写入的位置
        \'\'\'
        1；使用with循环打开文件，并且注入到“f”当中
        \'\'\'
        with open(path,\'a\',encoding=\'utf-8\') as f:
            f.write(name + \'\\n\\n\')
            for each in text:
                if each == \'h\':
                    write_flag = False
                if write_flag == True and each != \' \':
                    f.write(each)
                if write_flag == True and each == \'\\r\':
                    f.write(\'\\n\')
            f.write(\'\\n\\n\')

if __name__ == "__main__":
    target_url = str(input("请输入小说目录下载地址：\\n"))
    d = download(target=target_url)
    name , numbers, url_dict = d.download_url()
    if name in os.listdir():
        os.remove(name)
    index =1
    print("《%s》下载中:" % name[:-4])

    for key,value in url_dict.items():
        d.writer(key,name,d.Downloader(value))
        sys.stdout.write("已下载:%.3f%%" % float(index / numbers) + \'\\r\')
        sys.stdout.flush()
        index += 1

    print("《%s》下载完成！" % name[:-4])

以上是关于Python3----下载小说代码的主要内容，如果未能解决你的问题，请参考以下文章

学习笔记：python3，代码片段（2017）

scrapy主动退出爬虫的代码片段(python3)

scrapy按顺序启动多个爬虫代码片段(python3)

Python3网络爬虫：使用Beautiful Soup爬取小说

python3+beautifulSoup4.6抓取某网站小说基础功能设计

python爬虫之小说网站--下载小说(正则表达式)