Scrapy将产量与同一包装结合起来
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy将产量与同一包装结合起来相关的知识,希望对你有一定的参考价值。
我试图在不同年份刮掉许多页面并将它们添加到应该进入管道的同一个包中。这就是我所拥有的:
def parse_main_page(self, response):
dropdown = response.xpath(".//form[@name='somenname']")
if dropdown:
options = dropdown.xpath("./select//option")
for option in options:
text = option.xpath("./text()").get()
value = option.xpath("./@value").get()
url = self.compose_fin_link(value, pkg)
req= scrapy.Request(url, callback= self.parse_option)
req.meta['method'] = 'FINSIT'
req.meta['item'] = pkg
req.meta['year'] = text
yield req
def parse_option(self, response):
pkg = response.meta['item']
year = response.meta['year']
pkg[year] = dict()
main = response.xpath(".//div[@id = 'main']//table//tr")
for row in main:
texts = row.xpath(".//font/text()").getall()
texts = [x.replace('.','') for x in texts]
if len(texts) > 1:
pkg[year][texts[0]] = texts[1]
else:
pass
yield pkg
例如,如果dropdown
有3个选项,在我的管道中我将得到3个包,每个包由parse_option
产生。我需要为每个选项产生only 1 package containing all three options
而不是3个包。
在scrapy
之外我会做这样的事情:
def parse_main_page(self, response):
pkg = dict()
for option in options:
pkg[year] = self.parse_option(url)
yield pkg
def parse_option(self, response):
##Do something here
return option_content
答案
有几种选择如何做到这一点。第一个是在meta
方法中将URL收集到parse_main_page
属性中,然后在parse_option
方法中一次弹出一个URL并将结果累积到另一个meta
密钥中,例如, item
。这些方面的东西:
def parse_main_page(self, response):
dropdown = response.xpath(".//form[@name='somenname']")
if dropdown:
options = []
for option in dropdown.xpath("./select//option"):
text = option.xpath("./text()").get()
value = option.xpath("./@value").get()
url = self.compose_fin_link(value, pkg)
options.append({
'url': url,
'method': 'FINSIT',
'item': pkg,
'year': text
})
next_option = options.pop()
url = next_option.pop('url')
meta = {
'data': next_option,
'options': options,
'item': {} # we will collect the data for each individual option to this
}
yield scrapy.Request(url, callback=self.parse_option, meta=meta)
def parse_option(self, response):
# parsing logic ommited as it stays mostly the same
# ...
item = response.meta['item']
current_options = response.meta['data']
# `current_option` contains `item` and `year` values, use it to populate an intermediate `item` dict
options = response.meta['options'] # other options that need to get processed
if options:
next_option = options.pop()
url = next_option.pop('url')
meta = {
'data': next_option,
'options': options,
'item': item # contains data from all options processed so far
}
yield scrapy.Request(url, callback=self.parse_option, meta=meta)
else:
# no more options to process, our `item` is finalized
yield item
它没有经过测试,因为您没有提供所有必要的信息,但原则上应该有效。
另一种选择是使用scrapy-inline-requests从下拉URL中提取数据是最后阶段。查看回购中提供的the example。基本上,在你的情况下你会这样:
@inline_requests
def parse_main_page(self, response):
dropdown = response.xpath(".//form[@name='somenname']")
if dropdown:
item = {}
for option in dropdown.xpath("./select//option"):
text = option.xpath("./text()").get()
value = option.xpath("./@value").get()
url = self.compose_fin_link(value, pkg)
response = yield scrapy.Request(url)
# ...
# process the response as in `parse_option` method and collect results in `item`
# ...
yield item
以上是关于Scrapy将产量与同一包装结合起来的主要内容,如果未能解决你的问题,请参考以下文章