Scrapy spider cralws每页只有一个链接

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy spider cralws每页只有一个链接相关的知识,希望对你有一定的参考价值。

我想从http://www.nyhistory.org/programs/upcoming-public-programs抓取所有事件数据。事件是分页的,每页5个事件。我创建了两个规则:一个遵循下一页,另一个遵循事件的详细信息页面。因此,我希望蜘蛛首先进入每个事件的网址,从那里收集我需要的所有数据,然后进入下一页,输入每个事件的网址,依此类推。但是,出于某种原因,我的Spider只从每个页面访问一个事件,这只是第一个事件。请看下面的代码

import scrapy
from nyhistory.items import EventItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from datetime import datetime
from w3lib.html import remove_tags
from scrapy.selector import Selector
import re

class NyhistorySpider(CrawlSpider):

    name = "events"

    start_urls = ['http://www.nyhistory.org/programs/upcoming-public-programs',]

    rules = [Rule(LinkExtractor(allow='.*?page=.*',restrict_xpaths='//li[@class="pager-next"]'), follow=True),
            Rule(LinkExtractor(restrict_xpaths='//div[@class="view-content"]/div[contains(@class,"views-row")]'), callback='parse_event_details',follow=True),
            ]

    def parse_event_details(self, response):

        base_url = 'http://www.nyhistory.org'

        item = EventItem()
        item['title'] = response.xpath('//div[@class="views-field-title"]//text()')[2].extract()
        item['eventWebsite'] = response.url

        details_area = response.xpath('//div[@class="body-programs"]')
        details_area_str = " ".join(details_area.extract())
        details_area_str_split = re.split('EVENT DETAILS|LOCATION|PURCHASING TICKETS', details_area_str)
        speakers_names_area = details_area_str_split[1]
        speakersNames = Selector(text=speakers_names_area).xpath('strong').extract()
        try:
            item['speaker1FirstName'] = speakersNames[0].split()[0]
            item['speaker1LastName'] = speakersNames[0].split()[1]
        except:
            item['speaker1FirstName'] = ''
            item['speaker1LastName'] = ''

        description = remove_tags(details_area_str_split[1]).strip()
        item['description'] = description

        try:
            address_line = remove_tags(details_area_str_split[2]).strip()
            item['location'] = address_line.split(',')[0]
            item['city'] = address_line.split(',')[-2].strip()
            item['state'] = address_line.split(',')[-1].split()[0]
            item['zip'] = address_line.split(',')[-1].split()[1]
            item['street'] = address_line.split(',')[1].strip()
        except:
            item['location'] = ''
            item['city'] = ''
            item['state'] = ''
            item['zip'] = ''
            item['street'] = ''

        try:
            item['dateFrom'] = self.date_converter(response.xpath('//span[@class="date-display-single"]/text()').extract_first(default='').rstrip(' - '))
        except:
            try:
                item['dateFrom'] = response.xpath('//span[@class="date-display-single"]/text()').extract()[1].split('|')[0]
            except:
                item['dateFrom'] = ''
        try:
            item['startTime'] = self.time_converter(response.xpath('//span[@class="date-display-start"]/text()')[1].extract())
            # item['endTime'] = self.time_converter(response.xpath('//span[@class="date-display-end"]/text()')[1].extract())
        except:
            try:
                item['startTime'] = self.time_converter(response.xpath('//span[@class="date-display-single"]/text()').extract()[1].split(' | ')[1])
            except:
                item['startTime'] = ''
        item['In_group_id'] = ''
        try:
            item['ticketUrl'] = base_url + response.xpath('//a[contains(@class,"btn-buy-tickets")]/@href').extract_first()
        except:
            item['ticketUrl'] = ''
        item['eventImage'] = response.xpath('//div[@class="views-field-field-speaker-photo-1"]/div/div/img/@src').extract_first(default='')
        item['organization'] = "New York Historical Society"

        yield item

    @staticmethod
    def date_converter(raw_date):
        try:
            raw_date_datetime_object = datetime.strptime(raw_date.replace(',',''), '%a %m/%d/%Y')
            final_date = raw_date_datetime_object.strftime('%d/%m/%Y')
            return final_date
        except:
            raw_date_datetime_object = datetime.strptime(raw_date.replace(',','').replace('th','').strip(), '%a %B %d %Y')
            final_date = raw_date_datetime_object.strftime('%d/%m/%Y')
            return final_date
    @staticmethod
    def time_converter(raw_time):
        raw_time_datetime_object = datetime.strptime(raw_time, '%I:%M %p')
        final_time = raw_time_datetime_object.strftime('%I:%M %p')
        return final_time 
答案

当使用CrawlSpider时,规则就像你提到的那样遵循相应的链接“直到”你找到你真正想要获得的项目。

但是,蜘蛛(或规则)如何知道何时停止?这是为了使用callbackfollow属性。如果你使用callback然后你不需要follow(因为callback指定该链接需要作为响应处理),如果你使用follow然后你不需要callback,因为它告诉蜘蛛继续寻求新的链接。

你将不得不定义更好的规则,并指定哪些到follow以及哪些返回到callback

以上是关于Scrapy spider cralws每页只有一个链接的主要内容,如果未能解决你的问题,请参考以下文章

Scrapy多个spider情况下piplineitem设置

Python - for循环,它产生的抓取数据每页只循环一次

Scrapy入门到放弃06:Spider中间件

Scrapy 组件的具体用法

Scrapy Spider中间件,你学会了吗?本篇博客有一案例

4.5. scrapy两大爬虫类_Spider