使用 Scrapy 传递请求

Posted 2023-02-23

技术标签:

【中文标题】使用 Scrapy 传递请求【英文标题】：Passing requests with Scrapy 【发布时间】：2022-01-22 20:26:49 【问题描述】：

我正在尝试根据 url 中的 brand 数字传递带有 scrapy 的请求，然后从提供下一页信息的网页中提取 id's，然后遍历下一页以获取产品 ID。

我可以传递请求并解析产品数据并将其发送到请求中，但是我不确定定义函数以让我抓取下一页的光标。

这是我的代码：

class DepopItem(scrapy.Item):
    brands = Field(output_processor=TakeFirst())
    ID = Field(output_processor=TakeFirst())
    brand = Field(output_processor=TakeFirst())

class DepopSpider(scrapy.Spider):
    name = 'depop'
    start_urls = ['https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb&currency=GBP&sort=relevance']

    brands = [1596]

    custom_settings = 
        'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
    
    
    def start_requests(self, cursor=''):
        for brand in self.brands:
            for item in self.create_product_request(brand):
                yield item
    
        yield scrapy.FormRequest(
            url='https://webapi.depop.com/api/v2/search/products/',
            method='GET',
            formdata=
                'brands': str(brand),
                'cursor': cursor,
                'itemsPerPage': '24',
                'country': 'gb',
                'currency': 'GBP',
                'sort': 'relevance'
            ,
            cb_kwargs='brand': brand
        )

    def parse(self, response, brand):

        # load stuff
        for item in response.json().get('products'):
            loader = ItemLoader(DepopItem())
            loader.add_value('brand', brand)
            loader.add_value('ID', item.get('id'))
            
            yield loader.load_item()

        cursor = response.json()['meta'].get('cursor')
        if cursor:
            for item in self.create_product_request(brand, cursor):
                yield item

    def create_product_request(self, response):
        test = response.json()['meta'].get('cursor')
        yield test

我收到以下错误：

AttributeError: 'int' 对象没有属性 'json'

预期输出：

"brand": 1596, "ID": 273027529
"brand": 1596, "ID": 274115361
"brand": 1596, "ID": 270641301
"brand": 1596, "ID": 274505678
"brand": 1596, "ID": 262857014
"brand": 1596, "ID": 270088589
"brand": 1596, "ID": 208498028
"brand": 1596, "ID": 270426792
"brand": 1596, "ID": 274483351
"brand": 1596, "ID": 274109923
"brand": 1596, "ID": 273424157
..
..
..

【问题讨论】：

【参考方案1】：

start_requests 在发出请求之前运行。

您可以递归处理分页。

import scrapy
from scrapy.loader import ItemLoader
from scrapy import Field
from scrapy.loader.processors import TakeFirst


class DepopItem(scrapy.Item):
    brands = Field(output_processor=TakeFirst())
    ID = Field(output_processor=TakeFirst())
    brand = Field(output_processor=TakeFirst())


class DepopSpider(scrapy.Spider):
    name = 'depop'

    start_urls = ['https://webapi.depop.com/api/v2/search/products/']

    brands = [1596]

    custom_settings = 
        'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
    

    def parse(self, response):
        json_data = response.json()

        # pagination
        cursor = json_data['meta']['cursor']
        if json_data['meta']['hasMore']:
            yield scrapy.FormRequest(
                url='https://webapi.depop.com/api/v2/search/products/',
                method='GET',
                formdata='cursor': cursor
            )

        for brand in self.brands:
            yield scrapy.FormRequest(
                url='https://webapi.depop.com/api/v2/search/products/',
                method='GET',
                formdata=
                    'brands': str(brand),
                    'cursor': cursor,
                    'itemsPerPage': '24',
                    'country': 'gb',
                    'currency': 'GBP',
                    'sort': 'relevance'
                ,
                cb_kwargs='brand': brand,
                callback=self.parse_brand
            )

    def parse_brand(self, response, brand):
        # load stuff
        for item in response.json().get('products'):
            loader = ItemLoader(DepopItem())
            loader.add_value('brand', brand)
            loader.add_value('ID', item.get('id'))
            yield loader.load_item()

输出：

'ID': 245137362, 'brand': 1596
'ID': 244263081, 'brand': 1596
'ID': 242128472, 'brand': 1596
'ID': 239929000, 'brand': 1596
...
...
...

顺便说一句，使用轮换代理什么的，因为“请求太多”我被阻止了 10 分钟。

【讨论】：

这就像一个魅力！通过在yield scrapy.FormRequest 上方定义def create_product_request(self, brand, cursor=''):，我也设法让它工作。我对轮换代理没有经验，您有示例或链接吗？我现在刚刚在自定义设置中实现了DOWNLOAD_DELAY。这个有一个middleware。

以上是关于使用 Scrapy 传递请求的主要内容，如果未能解决你的问题，请参考以下文章