10994 部漫画信息大采集，竟然存在反爬！

Posted 2021-06-30 梦想橡皮擦

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了10994 部漫画信息大采集，竟然存在反爬！相关的知识，希望对你有一定的参考价值。

橡皮擦的周末时间，浏览互联网，畅游知识的海洋，寻找好看的动漫，然后就发现了本文的主角，一个来自台湾省的网站。

Kindle 漫畫|Kobo 漫畫|epub 漫畫大采集

数据源分析

爬取目标

本次要爬取的网站是 https://vol.moe/，该网站打开的第一眼，就给我呈现了一个大数，收录 10994 部漫画，必须拿下。

10994 部漫画信息大采集，竟然存在反爬！
为了降低博客的篇幅，还有大家练习的难度，本文只针对列表页抓取，里面涉及的目标数据结构如下：

漫画标题
漫画评分
漫画详情页链接
作者

同步保存漫画封面，文件名为漫画标题。

漫画详情页还存在大量的标签，因不涉及数据分析，故不再进行提取。

使用的 Python 模块

爬取模块 requests，数据提取模块 re，多线程模块 threading

重点学习内容

爬虫基本套路
数据提取，重点在行内 CSS 提取
CSV 格式文件存储
IP 限制反爬，没想到目标网站有反爬。

列表页分析

通过点击测试，得到的页码变化逻辑如下：

1. https://vol.moe/l/all,all,all,sortpoint,all,all,BL/1.htm
2. https://vol.moe/l/all,all,all,sortpoint,all,all,BL/2.htm
3. https://vol.moe/l/all,all,all,sortpoint,all,all,BL/524.htm

页面数据为服务器直接静态返回，即加载到 html 页面中。

10994 部漫画信息大采集，竟然存在反爬！
目标数据所在的 HTML DOM 结构如下所示。

在正式编写代码前，可以先针对性的处理一下正则表达式部分。

匹配封面图

编写该部分正则表达式时，出现如下图所示问题，折行+括号。
10994 部漫画信息大采集，竟然存在反爬！
基于正则表达式编写经验，正则表达式如下，重点学习 \\s 可匹配换行符。

<div class="img_book"[.\\s]*style="background:url\\((.*?)\\)

匹配文章标题、作者、详情页地址

由于该部分内容所在的 DOM 行格式比较特殊，可以直接匹配出结果，同时下述正则表达式使用了分组命名。

<a href='(?P<url>.*?)'>(?P<title>.*?)</a> <br /> (?P<author>.*?) <br />

对于动漫评分部分的匹配就非常简单了，直接编写如下正则表达式即可。

<p style=".*?"><b>(.*?)</b></p>

整理需求如下

开启 5 个线程对数据进行爬取；
保存所有的网页源码；
依次读取源码，提取元素到 CSV 文件中。

本案例将优先保存网页源码到本地，然后在对本地 HTML 文件进行处理。

编码时间

在编码测试的过程中，发现该网站存在反爬措施，即爬取过程中，如果网站发现是爬虫，直接就把 IP 给封杀掉了，程序还没有跑起来，就结束了。

再次测试，发现反爬技术使用重定向操作，即服务器发现是爬虫之后，直接返回状态码 302，并重定向谷歌网站，这波操作可以。

受限制的时间，经测试大概是 24 个小时，因此如果希望爬取到全部数据，需要通过不断切换 IP 和 UA ，将 HTML 静态文件保存到本地中。

下述代码通过判断目标网站返回状态码是否为 200，如果为 302，更换代理 IP，再次爬取。

import requests
import re
import threading
import time
import random

# 以下为UA列表，为节约博客篇幅，列表只保留两项，完整版请去CSDN代码频道下载
USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
]


# 循环获取 URL
def get_image(base_url, index):
    headers = {
        "User-Agent": random.choice(USER_AGENTS)

    }

    print(f"正在抓取{index}")
    try:
        res = requests.get(url=base_url, headers=headers, allow_redirects=False, timeout=10)
        print(res.status_code)
        # 当服务器返回 302状态码之后
        while res.status_code == 302:
        	# 可直接访问 http://118.24.52.95:5010/get/ 获得一个代理IP
            ip_json = requests.get("http://118.24.52.95:5010/get/", headers=headers).json()
            ip = ip_json["proxy"]
            proxies = {
                "http": ip,
                "https": ip
            }
            print(proxies)
            # 使用代理IP继续爬取
            res = requests.get(url=base_url, headers=headers, proxies=proxies, allow_redirects=False, timeout=10)
            time.sleep(5)
            print(res.status_code)

        else:
            html = res.text
            # 读取成功，保存为 html 文件
            with open(f"html/{index}.html", "w+", encoding="utf-8") as f:
                f.write(html)

        semaphore.release()
    except Exception as e:
        print(e)
        print("睡眠 10s，再去抓取")
        time.sleep(10)
        get_image(base_url, index)


if __name__ == '__main__':
    num = 0
    # 最多开启5个线程
    semaphore = threading.BoundedSemaphore(5)
    lst_record_threads = []
    for index in range(1, 525):
        semaphore.acquire()
        t = threading.Thread(target=get_image, args=(
            f"https://vol.moe/l/all,all,all,sortpoint,all,all,BL/{index}.htm", index))
        t.start()
        lst_record_threads.append(t)

    for rt in lst_record_threads:
        rt.join()

只开启了 5 个线程，爬取过程如下所示。

10994 部漫画信息大采集，竟然存在反爬！
经过 20 多分钟的等待，524 页数据，全部以静态页形式保存在了本地 HTML 文件夹中。

10994 部漫画信息大采集，竟然存在反爬！
当文件全部存储到本地之后，在进行数据提取就非常简单了

编写如下提取代码，使用 os 模块。

import os
import re
import requests


def reade_html():
    path = r"E:\\pythonProject\\test\\html"
    # 读取文件
    files = os.listdir(path)

    for file in files:
    	# 拼接完整路径
        file_path = os.path.join(path, file)
        with open(file_path, "r", encoding="utf-8") as f:
            html = f.read()
            # 正则提取图片
            img_pattern = re.compile('<div class="img_book"[.\\s]*style="background:url\\((.*?)\\)')
            # 正则提取标题
            title_pattern = re.compile("<a href='(?P<url>.*?)'>(?P<title>.*?)</a> <br /> \\[(?P<author>.*?)\\] <br />")
            # 正则提取得到
            score_pattern = re.compile('<p style=".*?"><b>(.*?)</b></p>')
            img_urls = img_pattern.findall(html)
            details = title_pattern.findall(html)
            scores = score_pattern.findall(html)

            # save(details, scores)
            for index, url in enumerate(img_urls):
                save_img(details[index][1], url)

# 数据保存成 csv 文件
def save(details, scores):
    for index, detail in enumerate(details):
        my_str = "%s,%s,%s,%s\\n" % (detail[1].replace(",", "，"), detail[0], detail[2].replace(",", "，"), scores[index])
        with open("./comic.csv", "a+", encoding="utf-8") as f:
            f.write(my_str)

# 图片按照动漫标题命名
def save_img(title, url):
    print(f"正在抓取{title}--{url}")
    headers = {
        "User-Agent": "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
    }
    try:
        res = requests.get(url, headers=headers, allow_redirects=False, timeout=10)

        data = res.content
        with open(f"imgs/{title}.jpg", "wb+") as f:
            f.write(data)

    except Exception as e:
        print(e)


if __name__ == '__main__':
    reade_html()