Scrapy - 请求有效负载格式和类型

Posted

技术标签:

【中文标题】Scrapy - 请求有效负载格式和类型【英文标题】:Scrapy - Request Payload format and types 【发布时间】:2019-11-10 18:14:02 【问题描述】:

这是我的抓取过程的起点。

https://www.storiaimoveis.com.br/alugar/brasil

这是 AJAX 调用,它以 JSON 格式为每个页面返回数据。

https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress.street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment%2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia=true&size=20&from=0&sessionId=5ff29d7e-88d0-54d5-2641-e203cafd6f4e

我的 POST 请求失败并出现错误 404。过去这些请求需要有效负载给我带来了麻烦。我总是以某种方式解决问题,但现在我试图了解我对他们做错了什么。

我的问题是;

随scrapy 请求一起发送的请求负载是否需要特定类型或格式? 我需要在发送之前致电json.dumps(payload),还是将它们作为字典发送?。 是否需要在发送有效负载之前将每个键值对转换为字符串? 可能是我的请求失败的任何其他原因吗?

这是我的代码的相关部分。

class MySpider(CrawlSpider):

    name = 'myspider'

    start_urls = [
        'https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress.street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment%2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia=true&size=20&from=0&sessionId=5ff29d7e-88d0-54d5-2641-e203cafd6f4e'
    ]

    page = 1
    payload = "locations":["geo":"top_left":"lat":5.2717863,
                                                "lon":-73.982817,
                                    "bottom_right":"lat":-34.0891,
                                                    "lon":-28.650543,
                             "placeId":"ChIJzyjM68dZnAARYz4p8gYVWik",
                             "keywords":"Brasil",
                             "address":"label":"Brasil","country":"BR"],
               "operation":["RENT"],
               "bathrooms":[],
               "bedrooms":[],
               "garage":[],
               "features":[]
    headers = 
        'Accept': 'application/json',
        'Content-Type': 'application/json',
        'Referer': 'https://www.storiaimoveis.com.br/alugar/brasil',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    


    def parse(self, response):
        for url in self.start_urls:
            yield scrapy.Request(url=url,
                                 method='POST',
                                 headers=self.headers,
                                 body=json.dumps(self.payload),
                                 callback=self.parse_items)

    def parse_items(self, response):
        from scrapy.shell import inspect_response
        inspect_response(response, self)
        print response.text

【问题讨论】:

尝试并解释从初始 URL 开始手动创建搜索的步骤,以及如何尝试构建 URL 以供脚本使用。 【参考方案1】:

是的,您需要调用json.dumps(payload),因为请求正文需要是str or unicode,如文档中所述:https://docs.scrapy.org/en/latest/topics/request-response.html#request-objects

但是,在您的情况下,您的请求由于以下 2 个缺少标头而失败:Content-TypeReferer

为了获得正确的请求标头,我通常会这样做:

    检查 Chrome 开发工具中的标头:

    使用curlPostman 发出请求,直到我获得正确的标头。在这种情况下,Content-TypeReferer 似乎足以满足 HTTP 200 响应状态:

【讨论】:

以上是关于Scrapy - 请求有效负载格式和类型的主要内容,如果未能解决你的问题,请参考以下文章

爬行:“查询字符串参数”和“请求有效负载”之间的差异

Python3分布式爬虫(scrap+redis)基础知识和实战详解

scrapy工作流程

scrap框架

我需要为网站创建帐户生成器,但请求有效负载存在一些问题

尽管有效负载与成功的浏览器发起的请求相同,但通过 HTTPRequest 与 VBA 的“无效的多部分有效负载格式”