使用 Scrapy 传递请求
Posted
技术标签:
【中文标题】使用 Scrapy 传递请求【英文标题】:Passing requests with Scrapy 【发布时间】:2022-01-22 20:26:49 【问题描述】:我正在尝试根据 url 中的 brand
数字传递带有 scrapy
的请求,然后从提供下一页信息的网页中提取 id's
,然后遍历下一页以获取产品 ID。
我可以传递请求并解析产品数据并将其发送到请求中,但是我不确定定义函数以让我抓取下一页的光标。
这是我的代码:
class DepopItem(scrapy.Item):
brands = Field(output_processor=TakeFirst())
ID = Field(output_processor=TakeFirst())
brand = Field(output_processor=TakeFirst())
class DepopSpider(scrapy.Spider):
name = 'depop'
start_urls = ['https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance']
brands = [1596]
custom_settings =
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
def start_requests(self, cursor=''):
for brand in self.brands:
for item in self.create_product_request(brand):
yield item
yield scrapy.FormRequest(
url='https://webapi.depop.com/api/v2/search/products/',
method='GET',
formdata=
'brands': str(brand),
'cursor': cursor,
'itemsPerPage': '24',
'country': 'gb',
'currency': 'GBP',
'sort': 'relevance'
,
cb_kwargs='brand': brand
)
def parse(self, response, brand):
# load stuff
for item in response.json().get('products'):
loader = ItemLoader(DepopItem())
loader.add_value('brand', brand)
loader.add_value('ID', item.get('id'))
yield loader.load_item()
cursor = response.json()['meta'].get('cursor')
if cursor:
for item in self.create_product_request(brand, cursor):
yield item
def create_product_request(self, response):
test = response.json()['meta'].get('cursor')
yield test
我收到以下错误:
AttributeError: 'int' 对象没有属性 'json'
预期输出:
"brand": 1596, "ID": 273027529
"brand": 1596, "ID": 274115361
"brand": 1596, "ID": 270641301
"brand": 1596, "ID": 274505678
"brand": 1596, "ID": 262857014
"brand": 1596, "ID": 270088589
"brand": 1596, "ID": 208498028
"brand": 1596, "ID": 270426792
"brand": 1596, "ID": 274483351
"brand": 1596, "ID": 274109923
"brand": 1596, "ID": 273424157
..
..
..
【问题讨论】:
【参考方案1】:start_requests
在发出请求之前运行。
您可以递归处理分页。
import scrapy
from scrapy.loader import ItemLoader
from scrapy import Field
from scrapy.loader.processors import TakeFirst
class DepopItem(scrapy.Item):
brands = Field(output_processor=TakeFirst())
ID = Field(output_processor=TakeFirst())
brand = Field(output_processor=TakeFirst())
class DepopSpider(scrapy.Spider):
name = 'depop'
start_urls = ['https://webapi.depop.com/api/v2/search/products/']
brands = [1596]
custom_settings =
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
def parse(self, response):
json_data = response.json()
# pagination
cursor = json_data['meta']['cursor']
if json_data['meta']['hasMore']:
yield scrapy.FormRequest(
url='https://webapi.depop.com/api/v2/search/products/',
method='GET',
formdata='cursor': cursor
)
for brand in self.brands:
yield scrapy.FormRequest(
url='https://webapi.depop.com/api/v2/search/products/',
method='GET',
formdata=
'brands': str(brand),
'cursor': cursor,
'itemsPerPage': '24',
'country': 'gb',
'currency': 'GBP',
'sort': 'relevance'
,
cb_kwargs='brand': brand,
callback=self.parse_brand
)
def parse_brand(self, response, brand):
# load stuff
for item in response.json().get('products'):
loader = ItemLoader(DepopItem())
loader.add_value('brand', brand)
loader.add_value('ID', item.get('id'))
yield loader.load_item()
输出:
'ID': 245137362, 'brand': 1596
'ID': 244263081, 'brand': 1596
'ID': 242128472, 'brand': 1596
'ID': 239929000, 'brand': 1596
...
...
...
顺便说一句,使用轮换代理什么的,因为“请求太多”我被阻止了 10 分钟。
【讨论】:
这就像一个魅力!通过在yield scrapy.FormRequest
上方定义def create_product_request(self, brand, cursor=''):
,我也设法让它工作。我对轮换代理没有经验,您有示例或链接吗?我现在刚刚在自定义设置中实现了DOWNLOAD_DELAY
。
这个有一个middleware。以上是关于使用 Scrapy 传递请求的主要内容,如果未能解决你的问题,请参考以下文章