Scrapy将产量与同一包装结合起来

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy将产量与同一包装结合起来相关的知识,希望对你有一定的参考价值。

我试图在不同年份刮掉许多页面并将它们添加到应该进入管道的同一个包中。这就是我所拥有的:

def parse_main_page(self, response):
    dropdown = response.xpath(".//form[@name='somenname']")
    if dropdown:
        options = dropdown.xpath("./select//option")
        for option in options:
            text = option.xpath("./text()").get()
            value = option.xpath("./@value").get()
            url = self.compose_fin_link(value, pkg)
            req= scrapy.Request(url, callback= self.parse_option)
            req.meta['method'] = 'FINSIT'
            req.meta['item'] = pkg
            req.meta['year'] = text
            yield req

def parse_option(self, response):

    pkg = response.meta['item']
    year = response.meta['year']
    pkg[year] = dict()
    main = response.xpath(".//div[@id = 'main']//table//tr")
    for row in main:
        texts = row.xpath(".//font/text()").getall()
        texts = [x.replace('.','') for x in texts]
        if len(texts) > 1:
            pkg[year][texts[0]] = texts[1]
        else:
            pass
    yield pkg

例如,如果dropdown有3个选项,在我的管道中我将得到3个包,每个包由parse_option产生。我需要为每个选项产生only 1 package containing all three options而不是3个包。

scrapy之外我会做这样的事情:

def parse_main_page(self, response):
    pkg = dict()
    for option in options:
        pkg[year] = self.parse_option(url)
    yield pkg

def parse_option(self, response):

    ##Do something here
    return option_content
答案

有几种选择如何做到这一点。第一个是在meta方法中将URL收集到parse_main_page属性中,然后在parse_option方法中一次弹出一个URL并将结果累积到另一个meta密钥中,例如, item。这些方面的东西:

def parse_main_page(self, response):
    dropdown = response.xpath(".//form[@name='somenname']")
    if dropdown:
        options = []
        for option in dropdown.xpath("./select//option"):
            text = option.xpath("./text()").get()
            value = option.xpath("./@value").get()
            url = self.compose_fin_link(value, pkg)
            options.append({
                'url': url,
                'method': 'FINSIT',
                'item': pkg,
                'year': text
            })
        next_option = options.pop()
        url = next_option.pop('url')
        meta = {
            'data': next_option,
            'options': options,
            'item': {}  # we will collect the data for each individual option to this
        }
        yield scrapy.Request(url, callback=self.parse_option, meta=meta)

def parse_option(self, response):
    # parsing logic ommited as it stays mostly the same
    # ...
    item = response.meta['item']
    current_options = response.meta['data']
    # `current_option` contains `item` and `year` values, use it to populate an intermediate `item` dict
    options = response.meta['options']  # other options that need to get processed
    if options:
        next_option = options.pop()
        url = next_option.pop('url')
        meta = {
            'data': next_option,
            'options': options,
            'item': item  # contains data from all options processed so far
        }
        yield scrapy.Request(url, callback=self.parse_option, meta=meta)
    else:
        # no more options to process, our `item` is finalized
        yield item

它没有经过测试,因为您没有提供所有必要的信息,但原则上应该有效。

另一种选择是使用scrapy-inline-requests从下拉URL中提取数据是最后阶段。查看回购中提供的the example。基本上,在你的情况下你会这样:

@inline_requests
def parse_main_page(self, response):
    dropdown = response.xpath(".//form[@name='somenname']")
    if dropdown:
        item = {}
        for option in dropdown.xpath("./select//option"):
            text = option.xpath("./text()").get()
            value = option.xpath("./@value").get()
            url = self.compose_fin_link(value, pkg)
            response = yield scrapy.Request(url)
            # ...
            # process the response as in `parse_option` method and collect results in `item`
            # ...
        yield item

以上是关于Scrapy将产量与同一包装结合起来的主要内容,如果未能解决你的问题,请参考以下文章

将 Dash 与 Alfred 结合起来

将 C 中的 gtk3 与 C++ 中的 gtkmm 结合起来

scrapy主动退出爬虫的代码片段(python3)

如何将 Celery 与 asyncio 结合使用?

scrapy按顺序启动多个爬虫代码片段(python3)

如何组合绑定片段而不将它们包装在 XML 文字中