使用项目加载器scrapy获取键中的值
Posted
技术标签:
【中文标题】使用项目加载器scrapy获取键中的值【英文标题】:Get values within keys with item loader scrapy 【发布时间】:2022-01-22 07:30:56 【问题描述】:我正在尝试从网页响应页面中的键中提取一些值。不幸的是,当我这样做时,它只返回键,我似乎无法获取值。因为每个键都是一个很长的列表并且它们被编号,我似乎无法弄清楚如何获取所有键的值。
例如,这是我的工作代码:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.loader import ItemLoader
from scrapy.item import Field
from itemloaders.processors import TakeFirst
class DepopItem(scrapy.Item):
brands = Field(output_processor=TakeFirst())
class DepopSpider(scrapy.Spider):
name = 'depop'
allowed_domains = ["depop.com"]
start_urls = ['https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance']
custom_settings =
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse,
)
def parse(self, response):
resp= response.json()['brands']
for item in resp:
loader = ItemLoader(DepopItem(), selector=item)
loader.add_value('brands', item)
yield loader.load_item()
这会返回一个键列表:
"brands": "1"
"brands": "2"
"brands": "3"
"brands": "4"
"brands": "5"
"brands": "7"
"brands": "9"
相反,我想要与这些键对应的值:
"brands": 946
"brands": 2376
"brands": 1286
"brands": 2774
"brands": 489
"brands": 11572
"brands": 1212
【问题讨论】:
【参考方案1】:使用values()
或resp[item]
。
例子:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.item import Field
from itemloaders.processors import TakeFirst
class DepopItem(scrapy.Item):
brands = Field(output_processor=TakeFirst())
class DepopSpider(scrapy.Spider):
name = 'depop'
allowed_domains = ["depop.com"]
start_urls = ['https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance']
custom_settings =
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
def parse(self, response):
resp = response.json()['brands']
for item in resp.values():
loader = ItemLoader(DepopItem(), selector=item)
loader.add_value('brands', item['count'])
yield loader.load_item()
输出:
'brands': 888
'brands': 1
'brands': 52
'brands': 138
'brands': 148
...
...
...
【讨论】:
啊,太简单了!不过,我永远不会得到它。谢谢!【参考方案2】:我不确定 scrapy 怎么样,但你可以这样做:
import requests
import json
from itertools import starmap
from requests.models import Response
from typing import Dict, List
url = "https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance"
resp: Response = requests.get(url)
data: Dict = json.loads(resp.text).get("brands")
values: List[Dict] = list(starmap(lambda k,v: "brands": v["count"], data.items()))
输出:
['brands': 989,
'brands': 1838,
'brands': 2415,
'brands': 1344,
...]
【讨论】:
我知道这种方法是我目前正在做的,但我特别希望通过它来提高我的技能。感谢您的尝试!以上是关于使用项目加载器scrapy获取键中的值的主要内容,如果未能解决你的问题,请参考以下文章
Scrapy - 为项目中的特定蜘蛛(而不是其他蜘蛛)使用提要导出器