XHR 请求在 Scrapy 中失败,但在 python-requests 中有效

Posted

技术标签:

【中文标题】XHR 请求在 Scrapy 中失败,但在 python-requests 中有效【英文标题】:XHR request fails in Scrapy but works in python-requests 【发布时间】:2016-10-27 03:13:00 【问题描述】:

我正在尝试使用 Ajax 从站点获取数据,我只是用标头和正文模拟 XHR 请求,然后收到 400 响应,告诉我该请求不被允许。这是我的代码:

from scrapy import Spider
from scrapy import Request, FormRequest
import json

class jsonSpider(Spider):
    name = 'json'

    start_urls = [
        'http://m.ctrip.com/restapi/soa2/10932/hotel/Product/domestichotelget']

    def start_requests(self):
        headers = 
            "Host": "m.ctrip.com",
            "User-Agent": "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (Khtml, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16",
            "Accept": "application/json",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate",
            "Content-Type": "application/json",
            "cookieOrigin": "http://wap.ctrip.com",
            "Cache-Control": "no-cache",
            "Referer": "http://wap.ctrip.com/webapp/hotel/hoteldetail/426638.html?days=1&atime=20160623&contrl=2&num=1&biz=1",
            "Content-Length": "455",
            "Origin": "http://wap.ctrip.com",
            "Connection": "keep-alive"
        data = '"biz":1,"contrl":3,"facility":0,"faclist":[],"key":"","keytp":0,"pay":0,"querys":[],"couponlist":[],"setInfo":"cityId":2,"dstId":0,"inDay":"2016-06-24","outDay":"2016-06-25","sort":"dir":1,"idx":70,"ordby":0,"size":100,"qbitmap":0,"alliance":"ishybrid":0,"head":"ctok":"","cver":"1.0","lang":"01","sid":"8888","syscode":"09","auth":null,"extension":["name":"pageid","value":"212093","name":"webp","value":0,"name":"protocal","value":"http"],"contentType":"json"'
        for url in self.start_urls:
            yield Request(
                url,
                self.parse,
                method='POST',
                headers=headers,
                body=data
            )

    def parse(self, response):
        page = response.body
        print(page)

但是当我用 python 请求模拟 XHR 时,它工作正常并得到 json 响应,这是我的代码使用请求:

import requests

url = 'http://m.ctrip.com/restapi/soa2/10932/hotel/Product/domestichotelget'
headers = 
    "Host": "m.ctrip.com",
    "User-Agent": "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16",
    "Accept": "application/json",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Content-Type": "application/json",
    "cookieOrigin": "http://wap.ctrip.com",
    "Cache-Control": "no-cache",
    "Referer": "http://wap.ctrip.com/webapp/hotel/hoteldetail/426638.html?days=1&atime=20160623&contrl=2&num=1&biz=1",
    "Content-Length": "455",
    "Origin": "http://wap.ctrip.com",
    "Connection": "keep-alive"
body = '"biz":1,"contrl":3,"facility":0,"faclist":[],"key":"","keytp":0,"pay":0,"querys":[],"couponlist":[],"setInfo":"cityId":2,"dstId":0,"inDay":"2016-06-24","outDay":"2016-06-25","sort":"dir":1,"idx":70,"ordby":0,"size":100,"qbitmap":0,"alliance":"ishybrid":0,"head":"ctok":"","cver":"1.0","lang":"01","sid":"8888","syscode":"09","auth":null,"extension":["name":"pageid","value":"212093","name":"webp","value":0,"name":"protocal","value":"http"],"contentType":"json"'


response = requests.post(url, headers=headers, data=body).content
print(response)

我的scrapy代码有什么问题?

【问题讨论】:

【参考方案1】:

删除标题中的"Content-Length": "455",,让 Scrapy 自行计算。你的data 有 477 个字节长,所以我猜服务器会获取传入数据的前 455 个字节,并且由于它不完整而无法解析为 JSON,并返回 400,这意味着 Bad Request

【讨论】:

您能解释一下为什么相同的标头可以使用 Python 请求而不是使用 Scrapy 吗? 因为 python 请求显然会重新计算大小并覆盖标头中现有的Content-Length,而scrapy 只是添加了另一个Content-Length,因此在发出scrapy 请求时会被wireshark 捕获:Content-Length: 477\r\n Origin: http://wap.ctrip.com\r\nContent-Length: 455\r\n 【参考方案2】:

这对你有用,它为以下代码提供了 200 个响应

from scrapy import Spider
from scrapy import Request, FormRequest
import json


class jsonSpider(Spider):
   name = 'json_spider'

   start_urls = [
    'http://m.ctrip.com/restapi/soa2/10932/hotel/Product/domestichotelget']

   def start_requests(self):
      headers = 
        "Accept": "application/json",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "keep-alive"
      data = "biz":1,"contrl":3,"facility":0,"faclist":[],"key":"","keytp":0,"pay":0,"querys":[],"couponlist":[],"setInfo":"cityId":2,"dstId":0,"inDay":"2016-06-24","outDay":"2016-06-25","sort":"dir":1,"idx":70,"ordby":0,"size":100,"qbitmap":0,"alliance":"ishybrid":0,"head":"ctok":"","cver":"1.0","lang":"01","sid":"8888","syscode":"09","auth":None,"extension":["name":"pageid","value":"212093","name":"webp","value":0,"name":"protocal","value":"http"],"contentType":"json"
      for url in self.start_urls:
         yield Request(
                url,
                self.parse,
                method='POST',
                headers=headers,
                body=json.dumps(data)
        )

   def parse(self, response):
     page = response.body
     print(page)

【讨论】:

您能解释一下为什么相同的标头在 Python 请求中有效,而不是使用 Scrapy 吗? @Metalloy 在 scrapy 标头中,我们仅描述接受编码、内容类型、接受等。如果我们提供的内容长度可能不同,这可能会导致错误请求

以上是关于XHR 请求在 Scrapy 中失败,但在 python-requests 中有效的主要内容,如果未能解决你的问题,请参考以下文章

XHR 请求 URL 在尝试解析其内容时说不存在

带有跨域重定向的 Safari xhr (AJAX) 请求失败

服务器返回重定向 (302) 后,iOS 设备上的 AJAX 或 XHR 请求失败,代码为 0

ajax请求成功或失败的参数

如果修改 XHR,则 JQuery AJAX 请求失败并出现 CORS 错误

跨域 XHR 失败