python3+beautifulSoup4.6抓取某网站小说基础功能设计

Posted 2020-10-30 姚毛毛-linuxido.com

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python3+beautifulSoup4.6抓取某网站小说基础功能设计相关的知识，希望对你有一定的参考价值。

本章学习内容：
1、网页编码还原读取
2、功能设计

stuep1:网页编码还原读取


本次抓取对象：

http://www.cuiweijuxs.com/jingpinxiaoshuo/

按照第一篇的代码来进行抓取：

# -*- coding: UTF-8 -*-
from urllib import request

if __name__ == "__main__":
    chaper_url = "http://www.cuiweijuxs.com/jingpinxiaoshuo/"
    headers = {\'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0\'}
    req = request.Request(url=chaper_url, headers=headers)
    response = request.urlopen(req)
    html = response.read()
    print(html)

　　打印出

b\'<!doctype html>\\r\\n<html>\\r\\n<head>\\r\\n<title>\\xbe\\xab\\xc6\\xb7\\xd0\\xa1\\xcb\\xb5_………………

这样的内容，这个是编码格式的问题，在zipfile解压乱码的文章中已经说过了，所以需要先看下这个html网页的头部，看到编码格式是gbk

具体看http://www.cnblogs.com/yaoshen/p/8671344.html

另外一种程序检测方法是使用chardet（非原生库，需要安装），

charset = chardet.detect(html)
print(charset)
检测内容：{\'encoding\': \'GB2312\', \'confidence\': 0.99, \'language\': \'Chinese\'}

如果使用GB2312来解码是有问题的，尝试过后发现还是gbk比较有效，包含字符多一点

改写代码如下：

    html = html.decode(\'GBK\')
    #except:
    #    html = html.decode(\'utf-8\')
    print(html)

完整代码如下：

# -*- coding: UTF-8 -*-
from urllib import request
import chardet


if __name__ == "__main__":
    chaper_url = "http://www.cuiweijuxs.com/jingpinxiaoshuo/"
    headers = {\'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0\'}
    req = request.Request(url=chaper_url, headers=headers)
    response = request.urlopen(req)
    html = response.read()
    print(html)

    # 查看网页编码格式
    charset = chardet.detect(html)
    print(charset)

    # 查看网页内容
    #try:
    html = html.decode(\'GBK\')
    #except:
    #    html = html.decode(\'utf-8\')
    print(html)

View Code

stuep2：基础功能设计

建立class：Capture，定义初始化（__init__）、读取（readHtml）、保存（saveHtml）等基础功能函数，然后创建一个run函数来集成运行功能，

最后使用Capture().run()来运行
（1） __init__方法（双下划线），初始化参数

    def __init__(self):
        # 定义抓取网址
        self.init_url = \'http://www.cuiweijuxs.com/jingpinxiaoshuo/\'
        # 定义headers
        self.head = {
            \'User-Agent\': \'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19\'}

　（2）将读取网页包装为一个方法，并返回解析后等html对象

    def readHtml(self):
        # 以CSDN为例，CSDN不更改User Agent是无法访问的
        # 创建Request对象
        print(self.init_url)
        req = request.Request(self.init_url, headers=self.head)
        # 传入创建好的Request对象
        response = request.urlopen(req)
        # 读取响应信息并解码
        html = response.read().decode(\'GBK\')
        # 打印信息
        print(html)
        return html

（3）将读取的网页以utf-8的方式写入到文件中

    def saveHtml(self, file_name, file_content):
        file_object = open(file_name, \'w\', encoding=\'utf-8\')
        file_object.write(file_content)
        file_object.close()

（4）调用run方法，读取网页，再保存

    def run(self):
        try:
            html = self.readHtml()
            self.saveHtml(\'test.html\', html)
        except BaseException as error:
            print(error)


Capture().run()

完整代码如下：

 1 # -*- coding: UTF-8 -*-
 2 from urllib import request
 3 
 4 
 5 class Capture:
 6 
 7     def __init__(self):
 8         # 定义抓取网址
 9         self.init_url = \'http://www.cuiweijuxs.com/jingpinxiaoshuo/\'
10         # 定义headers
11         self.head = {
12             \'User-Agent\': \'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19\'}
13 
14 
15     def readHtml(self):
16         # 以CSDN为例，CSDN不更改User Agent是无法访问的
17         # 创建Request对象
18         print(self.init_url)
19         req = request.Request(self.init_url, headers=self.head)
20         # 传入创建好的Request对象
21         response = request.urlopen(req)
22         # 读取响应信息并解码
23         html = response.read().decode(\'GBK\')
24         # 打印信息
25         print(html)
26         return html
27 
28     def saveHtml(self, file_name, file_content):
29         file_object = open(file_name, \'w\', encoding=\'utf-8\')
30         file_object.write(file_content)
31         file_object.close()
32 
33     def run(self):
34         try:
35             html = self.readHtml()
36             self.saveHtml(\'test.html\', html)
37         except BaseException as error:
38             print(error)
39 
40 
41 Capture().run()

View Code

以上是关于python3+beautifulSoup4.6抓取某网站小说基础功能设计的主要内容，如果未能解决你的问题，请参考以下文章

令人抓狂的Python redis和rediscluster驱动包的安装

怎么用python把有道翻译的朗读抓下来

Python3网络爬虫实战-6APP爬取相关库的安装：Charles的安装

Qt6网络抓包工具项目实战导航目录

用html.parser抓网页中的超链接,返回list

Python3做采集