scrapy突然创建多个项目
Posted
技术标签:
【中文标题】scrapy突然创建多个项目【英文标题】:scrapy suddenly create multiple item 【发布时间】:2022-01-21 17:42:31 【问题描述】:Scrapy 随机返回比预期更多的嵌套 json 数量
这是我的代码的简短版本:
import scrapy
from scrapy import Selector
from eventSpider.items import EventspiderItem
import urllib.parse
class EventsSpider(scrapy.Spider):
name = 'eventSpider'
# base url to link to the end url we receive
baseUrl = "http://www.olympedia.org"
def start_requests(self):
start_urls = [
'http://www.olympedia.org/editions'
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_urls)
def parse_urls(self, response):
"""
Go through the table of owinter olympics
Get all the url to those olympics events
Send the urls down to parse items to get the items of interest
"""
# remove the last 2 as the events haven't happened yet
for tr in response.xpath("//table[2]//tr")[:-2]:
url = tr.xpath('td[1]//a//@href').extract_first()
# check for None. In this case, we elimiate the 2 events that was canelled
if url is None:
continue
else:
url_to_check = urllib.parse.urljoin(self.baseUrl, url)
yield scrapy.Request(url=url_to_check, callback=self.parse_items)
def parse_items(self, response):
"""
Get the items of interest
Extract the list of disciplines and their url
pass the url
"""
item = EventspiderItem()
selector = Selector(response)
table1_rows = selector.xpath("//table[1]//tr")
item['event_title'] = table1_rows[1].xpath('td//text()').extract_first()
item['event_place'] = table1_rows[2].xpath('td//text()').extract_first()
table2 = selector.xpath("//table[3]//tr")
discipline_list = []
url_list = []
for tr in table2:
urls = tr.xpath('td//a//@href').extract()
disciplines = tr.xpath('td//a//text()').extract()
for url in urls:
# # check if we get empty list
# if not url:
# continue
# else:
url_list.append(url)
for discipline in disciplines:
discipline_list.append(discipline)
for i, url in enumerate(url_list):
final_url = urllib.parse.urljoin(self.baseUrl, url)
event_name = item['event_title'] + " " + discipline_list[i]
yield scrapy.Request(url=final_url, callback=self.parse_sports, meta='event_item': item, 'discipline': event_name)
直到这里,如果我只是使用return item
而不是在最后一行中使用yield,一切都很好。如果我现在return item
,我会得到 23 嵌套 json,这正是我所期望的。
当我尝试生成我在 final_url 中获得的 url(也有 23 个)时,问题出现了,嵌套 json 的数量由于某种原因跳转到 248
def parse_sports(self, response):
selector = Selector(response)
item = response.meta.get('event_item')
return item
我不知道为什么会这样。任何帮助将不胜感激
【问题讨论】:
【参考方案1】:要在选择 table2 后选择 xpath,您必须使用 .//
并尝试此操作。
table2 = selector.xpath("//table[3]//tr")
discipline_list = []
url_list = []
for tr in table2:
urls = tr.xpath('.//td//a//@href').extract()
disciplines = tr.xpath('.//td//a//text()').extract()
【讨论】:
我的 table2 的 xpath 工作得很好。如前所述,问题在于当我尝试生成 23 url 时,由于某种原因,项目数从 23 变为 248 加1如下:for i, url in enumerate(url_list, 1)
以上是关于scrapy突然创建多个项目的主要内容,如果未能解决你的问题,请参考以下文章