5年保更新Python爬虫复盘案例，精彩文案多多多多

Posted 2022-12-16 梦想橡皮擦

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了5年保更新Python爬虫复盘案例，精彩文案多多多多相关的知识，希望对你有一定的参考价值。

最近收到 C 友反馈，说《爬虫 120 例之第 17 例，用 Python 面向对象的思路，采集各种精彩句子》这篇博客的目标图片站，已经不能使用了，橡皮擦查阅之后，发现是对方网站已经不在运营，所以更新一下案例。

如果你在订阅之后，发现其它站点出现类似情况，一定第一时间联系橡皮擦，每个爬虫都 质保 5 年
版权声明：本案例涉及所有内容仅供学习使用，请勿用于商业目的，如有侵权，请及时联系。

⚡⚡ 学习注意事项 ⚡⚡

文章会自动省略 http 和 https 协议，学习时请自行在地址中进行补充。
目标站点域名为 qunzou.com，在下文统一用 橡皮擦 代替，学习时请自行拼接。

文章目录

- ⛳️ 写作背景
- ⛳️ 实战编码

⛳️ 写作背景

原案例中的站点站长已经不运营了，所以沦为了广告站，对于我们学习来说就非常不友好了，所以咱们更新一下本案例，使用的所有技术都是原文内容，在学习的时候，可以对比进行学习。

案例用到的 Python 第三方库是 requests 和 lxml，代码基于面向对象编码方式进行编制。

本次目标站点的分页规则如下所示，（网站地址请看前文说明部分）

www.橡皮擦/xuexi/list_1_1.html
www.橡皮擦/xuexi/list_1_2.html
……
www.qunzou.com/xuexi/list_1_n.html

通过判断下一页按钮是否存在，来判断是否为最后一页。

⛳️ 实战编码

首先获取所有列表页地址，代码如下，关键步骤都写在注释中。

import requests
from lxml import etree
import random


class Spider16:
    def __init__(self):

        self.wait_urls = ["https://www.qunzou.com/xuexi/list_1_1.html"]
        self.url_template = "https://www.qunzou.com/xuexi/list_1_num.html"
        self.details = []
	# 主要用于获取 useragent
    def get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
            "Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)",
            "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)",
            "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36",
            "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
            "Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)",
            "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
            "Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
            "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",
            "Sosospider+(+http://help.soso.com/webspider.htm)",
            "Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
        ]
        ua = random.choice(uas)
        headers = 
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        
        return headers

    # 生成待爬取页面
    def create_urls(self):
        headers = self.get_headers()
        page_url = self.wait_urls[0]
        res = requests.get(url=page_url, headers=headers, timeout=5)
        html = etree.HTML(res.text)
        # 提取总页码
        last_page = html.xpath("//span[@class='pageinfo']/strong[1]/text()")[0]
        # 生成待爬取页面
        for i in range(1, int(last_page) + 1):
            self.wait_urls.append(self.url_template.format(num=i))


    def run(self):
        self.create_urls()

if __name__ == '__main__':
    s = Spider16()
    s.run()

生成所有目标页面之后，就可以获取列表页中的详情页地址，代码如下。

def get_html(self):
    for url in self.wait_urls:
        headers = self.get_headers()
        res = requests.get(url, headers=headers, timeout=5)
        if res:
            html = etree.HTML(res.text)
            # 提取详情页地址
            detail_link_list = html.xpath("//div[@class='list']//h6/a/@href")
            for d in detail_link_list:
                self.details.append(f"https://www.qunzou.comd")

详情页获取完毕，最后一个步骤就是得到内页数据了，继续补齐代码。

    def get_detail(self):
        for url in self.details:
            headers = self.get_headers()
            res = requests.get(url, headers=headers, timeout=5)
            res.encoding = "gb2312"
            if res:
                html = etree.HTML(res.text)
                # 获取句子
                sentences = html.xpath("//div[@id='content']//p/text()")
                # 打印句子
                long_str = "\\n".join(sentences)

                print(long_str)
                # with open("sentences.txt", "a+", encoding="utf-8") as f:
                #     f.write(long_str)

测试的时候，可以基于数据量进行调整，不用一次性全部都爬取到本地。

完整代码依旧是上传：codechina.csdn.net/hihell/python120

📢📢📢📢📢📢
💗 你正在阅读 【梦想橡皮擦】 的博客
👍 阅读完毕，可以点点小手赞一下
🌻 发现错误，直接评论区中指正吧
📆 橡皮擦的第 788 篇原创博客

从订购之日起，案例 5 年内保证更新

以上是关于5年保更新Python爬虫复盘案例，精彩文案多多多多的主要内容，如果未能解决你的问题，请参考以下文章