使用scrapy抓取电子商务

Posted 2023-02-24

技术标签:

【中文标题】使用scrapy抓取电子商务【英文标题】：Scraping e commerce using scrapy 【发布时间】：2020-07-02 20:08:55 【问题描述】：

我使用 scrapy 来抓取亚马逊网站只是为了学习。当我们按类别购物时，我们会得到一个产品列表，当我们点击一个产品时，我们会得到该产品的详细信息。我已经完成了从产品列表中抓取详细信息的基本部分，例如产品名称、价格及其链接。但我希望这些抓取的链接可以在当时和那里使用，并且每个产品的详细信息页面都应该在该程序本身中抓取。

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    start_urls = [
        'https://www.amazon.co.uk/s?me=A1NZU6VUR85CVU&marketplaceID=A1F83G8C2ARO7P'
    ]

    def parse(self, response):
        items = AmazonscrapyItem()
        all_div_quotes = response.css('body')
        for quotes in all_div_quotes:
            product = quotes.css('.a-color-base.a-text-normal').css('::text').extract()
            price = quotes.css('.a-offscreen').css('::text').extract()
            brand = quotes.css('.s-image::attr(src)').extract()
            asin = quotes.css(
                '.sg-col-20-of-24.s-result-item.sg-col-0-of-12.sg-col-28-of-32.sg-col-16-of-20.sg-col.sg-col-32-of-36.sg-col-12-of-16.sg-col-24-of-28::attr(data-asin)').extract()
            productlink = quotes.css('.a-link-normal.a-text-normal').css('::attr(href)').extract()

            items['product'] = product
            items['price'] = price
            items['brand'] = brand
            items['asin'] = asin
            items['productlink'] = productlink

            yield items

        next_page_link = response.css('.a-last a::attr(href)').extract_first()
        next_page_link = response.urljoin(next_page_link)

        yield scrapy.Request(url=next_page_link, callback=self.parse)

【问题讨论】：

【参考方案1】：

小心，亚马逊可以检测到爬虫，它们会阻止你。

class AmazonSpiderSpider(scrapy.Spider):
    name = "amazon_spider"


    def start_requests(self):
        page_links = ['https://www.amazon.co.uk/s?me=A1NZU6VUR85CVU&marketplaceID=A1F83G8C2ARO7P', ]
        pages = 2
        page_domain = "https://www.amazon.co.uk/s?i=merchant-items&me=A1NZU6VUR85CVU&page=2&marketplaceID=A1F83G8C2ARO7P&qid=1584935116&ref=sr_pg_"
        while page != 4: # 3 is the last page and it will stop making link when pages = 4
            link = page_domain + str(page)
            page_links.append(link)
            pages+=1

        #request all the pages
        for page in page_links:
            yield scrapy.Request(url=page, callback=self.parse)


    def parse(self, response):
        #scraped all product links
        domain = "https://www.amazon.co.uk"
        link_products = response.xpath('//h2/a/@href').extract()
        for link in link_products:
            product_link = domain + link
            yield scrapy.Request(url=product_link, callback=self.parse_contents)


    def parse_contents(self, response):
        #scrape needed information
        productlink = response.url
        product = response.xpath('//span[@id="productTitle"]/text()').extract()[0].strip()
        price = response.xpath('//span[@id="priceblock_ourprice"]/text()').extract()[0]

        ### I use try, cause amazon dont have a fixed value for brand
        try:
            brand = response.xpath('//a[@id="bylineInfo"]/text()').extract()[0]
        except IndexError:
            brand = response.xpath('//a[@id="brand"]/text()').extract()[0]

        items = AmazonscrapyItem()
        items['product'] = product
        items['price'] = price
        items['brand'] = brand
        # items['asin'] = asin # I dont know what are you trying to crawl here, sorry
        items['productlink'] = productlink

        yield items

【讨论】：

我想要的是，当我们获得产品链接时，我想为每个产品打开这些产品链接，并从包含更多产品详细信息（如每个产品链接的重量和排名等）的表中提取数据. @bonifacio_kid 我想创建一个新函数，在该函数中将请求产品链接，并将其中的数据与这 5 个参数一起保存

以上是关于使用scrapy抓取电子商务的主要内容，如果未能解决你的问题，请参考以下文章

如何利用Scrapy爬虫框架抓取网页全部文章信息（上篇）

删除重复的电子邮件

Scrapy spider不会在start-url列表上进行迭代

使用python抓取AJAX电子商务网站

Selenium 无法使用 python 抓取 Shopee 电子商务网站

抓取受保护的电子邮件