Scrapy链接已爬行但未刮擦
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy链接已爬行但未刮擦相关的知识,希望对你有一定的参考价值。
我在电子商务网站Cdiscount上制作了一个刮刀来抓取与“au-quotidien”相关的所有类别。机器人应该从顶层菜单开始,然后访问第二层深层,然后是第三层,并刮取项目。这是我的代码,作为测试:
class CdiscountSpider(scrapy.Spider):
name = "cdis_bot" # how we have to call the bot
start_urls = ["https://www.cdiscount.com/au-quotidien/v-127-0.html"]
def parse(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
regex_top_category = r"(?=w)" + re.escape("au-quotidien") + r"(?!w)"
if re.search(regex_top_category, link):
yield response.follow(link, callback = self.parse_on_categories) #going to one layer deep from landing page
def parse_on_categories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_on_subcategories) #going to two layer deep from landing page
def parse_on_subcategories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_data) #going to three layer deep from landing page
def parse_data(self, response):
links_list = response.css("div.prdtBILDetails a::attr(href)").extract()
regex_ean = re.compile(r'(d+).html')
eans_list = [regex_ean.search(link).group(1) for link in links_list if regex_ean.search(link)]
desc_list = response.css("div.prdtBILTit::text").extract()
price_euros = response.css("span.price::text").extract()
price_cents = response.css("span.price sup::text").extract()
for euro, cent, ean, desc in zip(price_euros, price_cents, eans_list, desc_list):
if len(ean) > 6:
yield{'ean':ean,'price':euro+cent,'desc':desc,'company':"cdiscount",'url':response.url}
我的问题是,只检索链接。
例如 :
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/legumes-secs/l-127015303.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/semoules/l-127015302.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
但我只得到很少的碎片,总是在同一类别,如下所示:
{'ean': '2009818241269', 'price': '96€00', 'desc': 'Heidsieck & Co Monopole 75cl x6', 'company': 'cdiscount', 'url': 'https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html'}
2018-12-18 14:40:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html>
虽然在我看来,其他类别共享相同的项目选择器。
如果你能帮助我弄清楚我错在哪里,我将不胜感激:)谢谢
答案
看起来你的parse_data()
方法收到的反应都差别很大。
例如,这些是它在示例运行中解析的前三个URL:
https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-millesime/l-1293404.html
https://www.cdiscount.com/vin-champagne/coffrets-cadeaux/v-12960-12960.html
https://www.cdiscount.com/au-quotidien/alimentaire/bio/boisson-bio/jus-de-tomates-bio/l-12701271315.html
很明显(即使从快速浏览一下)每个页面的结构也不同。
在大多数情况下,你的eans_list
和desc_list
都是空的,所以zip()
调用没有结果。
以上是关于Scrapy链接已爬行但未刮擦的主要内容,如果未能解决你的问题,请参考以下文章
如何使用scrapy规则从Wiki演员和电影页面爬行到仅演员和fimlography链接中的链接