CrawlSpider的使用

Posted 2023-04-19

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了CrawlSpider的使用相关的知识，希望对你有一定的参考价值。

参考技术A

CrawlSpider是爬取那些具有一定规则网站的常用的爬虫，它基于Spider并有一些独特属性

rules是Rule对象的集合

以Chinaz为例:

因为CrawlSpider继承了Spider，所以具有Spider的所有函数。
首先由 start_requests 对 start_urls 中的每一个url发起请求（ make_requests_from_url )，这个请求会被parse接收。在Spider里面的parse需要我们定义，但CrawlSpider定义 parse 去解析响应（ self._parse_response(response, self.parse_start_url, cb_kwargs=, follow=True) ）
_parse_response 根据有无 callback , follow 和 self.follow_links 执行不同的操作

eg:

其中 _requests_to_follow 又会获取 link_extractor （这个是我们传入的LinkExtractor）解析页面得到的link （link_extractor.extract_links(response)） ,对url进行加工（process_links，需要自定义），对符合的link发起Request。使用 .process_request (需要自定义）处理响应。

CrawlSpider类会在 __init__ 方法中调用 _compile_rules 方法，然后在其中浅拷贝 rules 中的各个 Rule 获取要用于回调(callback)，要进行处理的链接（process_links）和要进行的处理请求（process_request)

eg:

因此LinkExtractor会传给link_extractor。

CrawlSpider和Spider一样，都要使用start_requests发起请求

以知乎为例:

仅为个人学习小结，若有错处，欢迎指正~

Scrapy框架--CrawlSpider

CrawlSpider类，Spider的一个子类
　　- 全站数据爬取的方式
　　　　- 基于Spider：手动请求
　　　　- 基于CrawlSpider
　　- CrawlSpider的使用：
　　　　- 创建一个工程
　　　　- cd XXX
　　- 创建爬虫文件（CrawlSpider）：
　　　　- scrapy genspider -t crawl xxx www.xxxx.com
　　　　- 链接提取器：
　　　　　　- 作用：根据指定的规则（allow）进行指定链接的提取
　　　　- 规则解析器：
　　　　　　- 作用：将链接提取器提取到的链接进行指定规则（callback）的解析

示例：爬取sun网站中的编号，新闻标题，新闻内容，标号

分析：爬取的数据没有在同一张页面中。
　　- 1.可以使用链接提取器提取所有的页码链接
　　- 2.让链接提取器提取所有的新闻详情页的链接

爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from Sun.items import SunItem,DetailItem


class SunSpider(CrawlSpider):
    name = ‘sun‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘http://wz.sun0769.com/political/index/politicsNewest?id=1&type=4‘]
    # 链接提取器：根据指定规则进行指定链接的提取
    link = LinkExtractor(allow=r‘id=1&page=d+‘)
    link_detail = LinkExtractor(allow=r‘index?id=d+‘)
    rules = (
        # 规则解析器：将链接提取器提取到的链接发送请求 并根据callback进行指定的解析操作
        Rule(link, callback=‘parse_item‘, follow=False),
        # follow=True 可以将链接提取器继续作用到 链接提取器提取的链接所对应的页面
        # 通过此设置可对所有的页码进行爬取 调度器有去重过滤功能
        Rule(link_detail, callback=‘parse_detail‘, follow=False),
    )

    # 以下两个方法不可以进行请求传参
    # 两个方法都把数据存储到item中 可采用两个item
    def parse_item(self, response):
        li_list = response.xpath(‘/html/body/div[2]/div[3]/ul[2]/li‘)
        for li in li_list:
            new_num = li.xpath(‘./span[1]/text()‘).extract_first()
            title = li.xpath(‘./span[3]/a/text()‘).extract_first()
            item = SunItem()
            item[‘new_num‘] = new_num
            item[‘title‘] = title
            yield item

    def parse_detail(self, response):
        print(111)
        item = DetailItem()
        new_id = response.xpath(‘/html/body/div[3]/div[2]/div[2]/div[1]/span[4]/text()‘).extract_first()
        content = response.xpath(‘/html/body/div[3]/div[2]/div[2]/div[2]//text()‘).extract_first()
        print(new_id, content)
        item[‘new_id‘] = new_id
        item[‘content‘] = content
        yield item

items.py

class SunPipeline:

    def process_item(self, item, spider):

        if item.__class__.__name__ == ‘SunItem‘:
            new_num = item[‘new_num‘]
            title = item[‘title‘]
        else:
            print(22)
            new_id = item[‘new_id‘]
            content = item[‘content‘]
            print(new_id,content)
        return item

注：在写正则表达式时，需要对特殊的符号进行转义。

以上是关于CrawlSpider的使用的主要内容，如果未能解决你的问题，请参考以下文章