使用scrapy抓取电子商务
Posted
技术标签:
【中文标题】使用scrapy抓取电子商务【英文标题】:Scraping e commerce using scrapy 【发布时间】:2020-07-02 20:08:55 【问题描述】:我使用 scrapy 来抓取亚马逊网站只是为了学习。当我们按类别购物时,我们会得到一个产品列表,当我们点击一个产品时,我们会得到该产品的详细信息。我已经完成了从产品列表中抓取详细信息的基本部分,例如产品名称、价格及其链接。但我希望这些抓取的链接可以在当时和那里使用,并且每个产品的详细信息页面都应该在该程序本身中抓取。
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = [
'https://www.amazon.co.uk/s?me=A1NZU6VUR85CVU&marketplaceID=A1F83G8C2ARO7P'
]
def parse(self, response):
items = AmazonscrapyItem()
all_div_quotes = response.css('body')
for quotes in all_div_quotes:
product = quotes.css('.a-color-base.a-text-normal').css('::text').extract()
price = quotes.css('.a-offscreen').css('::text').extract()
brand = quotes.css('.s-image::attr(src)').extract()
asin = quotes.css(
'.sg-col-20-of-24.s-result-item.sg-col-0-of-12.sg-col-28-of-32.sg-col-16-of-20.sg-col.sg-col-32-of-36.sg-col-12-of-16.sg-col-24-of-28::attr(data-asin)').extract()
productlink = quotes.css('.a-link-normal.a-text-normal').css('::attr(href)').extract()
items['product'] = product
items['price'] = price
items['brand'] = brand
items['asin'] = asin
items['productlink'] = productlink
yield items
next_page_link = response.css('.a-last a::attr(href)').extract_first()
next_page_link = response.urljoin(next_page_link)
yield scrapy.Request(url=next_page_link, callback=self.parse)
【问题讨论】:
【参考方案1】:小心,亚马逊可以检测到爬虫,它们会阻止你。
class AmazonSpiderSpider(scrapy.Spider):
name = "amazon_spider"
def start_requests(self):
page_links = ['https://www.amazon.co.uk/s?me=A1NZU6VUR85CVU&marketplaceID=A1F83G8C2ARO7P', ]
pages = 2
page_domain = "https://www.amazon.co.uk/s?i=merchant-items&me=A1NZU6VUR85CVU&page=2&marketplaceID=A1F83G8C2ARO7P&qid=1584935116&ref=sr_pg_"
while page != 4: # 3 is the last page and it will stop making link when pages = 4
link = page_domain + str(page)
page_links.append(link)
pages+=1
#request all the pages
for page in page_links:
yield scrapy.Request(url=page, callback=self.parse)
def parse(self, response):
#scraped all product links
domain = "https://www.amazon.co.uk"
link_products = response.xpath('//h2/a/@href').extract()
for link in link_products:
product_link = domain + link
yield scrapy.Request(url=product_link, callback=self.parse_contents)
def parse_contents(self, response):
#scrape needed information
productlink = response.url
product = response.xpath('//span[@id="productTitle"]/text()').extract()[0].strip()
price = response.xpath('//span[@id="priceblock_ourprice"]/text()').extract()[0]
### I use try, cause amazon dont have a fixed value for brand
try:
brand = response.xpath('//a[@id="bylineInfo"]/text()').extract()[0]
except IndexError:
brand = response.xpath('//a[@id="brand"]/text()').extract()[0]
items = AmazonscrapyItem()
items['product'] = product
items['price'] = price
items['brand'] = brand
# items['asin'] = asin # I dont know what are you trying to crawl here, sorry
items['productlink'] = productlink
yield items
【讨论】:
我想要的是,当我们获得产品链接时,我想为每个产品打开这些产品链接,并从包含更多产品详细信息(如每个产品链接的重量和排名等)的表中提取数据. @bonifacio_kid 我想创建一个新函数,在该函数中将请求产品链接,并将其中的数据与这 5 个参数一起保存以上是关于使用scrapy抓取电子商务的主要内容,如果未能解决你的问题,请参考以下文章
Scrapy spider不会在start-url列表上进行迭代