实用scrapy批量下载自己的博客园文章

Posted 2020-09-07 曾是土木人

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了实用scrapy批量下载自己的博客园文章相关的知识，希望对你有一定的参考价值。

首先，在items.py中定义几个字段用来保存网页数据（网址，标题，网页源码）

如下所示：

import scrapy


class MycnblogsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    page_title = scrapy.Field()
    page_url = scrapy.Field()
    page_html = scrapy.Field()

最重要的是我们的spider，我们这里的spider继承自CrawlSpider，方便我们定义正则来提示爬虫需要抓取哪些页面。

如：爬去下一页，爬去各个文章

在spdier中，我们使用parse_item方法来解析目标网页，从而得到文章的网址，标题和内容。

注：在parse_item方法中，我们在得到的html源码中，新增了base标签，这样打开下载后的html文件，不至于页面错乱，而是使用博客园的css样式

spdier源码如下:

# -*- coding: utf-8 -*-
from mycnblogs.items import MycnblogsItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CnblogsSpider(CrawlSpider):
    name = "cnblogs"
    allowed_domains = ["cnblogs.com"]
    start_urls = [\'http://www.cnblogs.com/hongfei/\']
    rules = (
        # 爬取下一页，没有callback，意味着follow为True
        Rule(LinkExtractor(allow=(\'default.html\\?page=\\d+\',))),
        # 爬取所有的文章，并使用parse_item方法进行解析，得到文章网址，文章标题，文章内容
        Rule(LinkExtractor(allow=(\'hongfei/p/\',)), callback=\'parse_item\'),
        Rule(LinkExtractor(allow=(\'hongfei/articles/\',)), callback=\'parse_item\'),
        Rule(LinkExtractor(allow=(\'hongfei/archive/\\d+/\\d+/\\d+/\\d+.html\',)), callback=\'parse_item\'),
    )

    def parse_item(self, response):
        item = MycnblogsItem()
        item[\'page_url\'] = response.url
        item[\'page_title\'] = response.xpath("//title/text()").extract_first()
        html = response.body.decode("utf-8")
        html = html.replace("<head>", "<head><base href=\'http://www.cnblogs.com/\'>")
        item[\'page_html\'] = html
        yield item

在pipelines.py文件中，我们使用process_item方法来处理返回的item

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don\'t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs

class MycnblogsPipeline(object):

    def process_item(self, item, spider):
        file_name = \'./blogs/\' + item[\'page_title\'] + \'.html\'
        with codecs.open(filename=file_name, mode=\'wb\', encoding=\'utf-8\') as f:
            f.write(item[\'page_html\'])
        return item

以下是item pipeline的一些典型应用：

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

为了启用一个Item Pipeline组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:

ITEM_PIPELINES = {
   \'mycnblogs.pipelines.MycnblogsPipeline\': 300,
}

程序运行后，将采集所有的文章到本地，如下所示：

原文地址：http://www.cnblogs.com/hongfei/p/6659934.html

以上是关于实用scrapy批量下载自己的博客园文章的主要内容，如果未能解决你的问题，请参考以下文章