需要帮助来模拟 xhr 请求

Posted 2023-02-19

技术标签:

【中文标题】需要帮助来模拟 xhr 请求【英文标题】：needing help to simulate an xhr request 【发布时间】：2019-06-07 20:44:21 【问题描述】：

我需要使用“加载更多按钮”来抓取网站。这是我用 Python 编写的蜘蛛代码：

import scrapy
import json
import requests
import re
from parsel import Selector
from scrapy.selector import Selector 
from scrapy.http import htmlResponse

headers = 
'origin': 'https://www.tayara.tn',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
'content-type': 'application/json',
'accept': '*/*',
'referer': 'https://www.tayara.tn/sc/immobilier/bureaux-et-plateaux',
'authority': 'www.tayara.tn',
'dnt': '1',

data = '"query":"query ListingsPage($page: Page, $filter: SearchFilter, $sortBy: SortOrder) \\n  listings: searchAds(page: $page, filter: $filter, sortBy: $sortBy) \\n    items \\n      uuid\\n      title\\n      price\\n      currency\\n      thumbnail\\n      createdAt\\n      category \\n        id\\n        name\\n        engName\\n        __typename\\n      \\n      user \\n        uuid\\n        displayName\\n        avatar(width: 96, height: 96) \\n          url\\n          __typename\\n        \\n        __typename\\n      \\n      __typename\\n    \\n    trackingInfo \\n      transactionId\\n      listName\\n      recommenderId\\n      experimentId\\n      variantId\\n      __typename\\n    \\n    totalCount\\n    pageInfo \\n      startCursor\\n      hasPreviousPage\\n      endCursor\\n      hasNextPage\\n      __typename\\n    \\n    __typename\\n  \\n\\n","variables":"page":"count":36,"offset":"cDEwbg==.MjAxOC0xMi0wMlQxMzo1MDoxMlo=.MzY=","filter":"queryString":null,"category":"140","regionId":null,"attributeFilters":[],"sortBy":"CREATED_DESC","operationName":"ListingsPage"'

class Tun(scrapy.Spider):

name="tayaracommercial"
start_urls = [
    'https://www.tayara.tn/sc/immobilier/bureaux-et-plateaux'
    ]

def parse(self, response):        
    yield Request('https://www.tayara.tn/graphql', method='post', headers=headers, body=data, self.parse_item)


def parse_item(self, response):

    source = 'Tayara'        
    reference = response.url.split('//')[1].split('/')[3]
    titre = response.xpath('//h1[@data-name="adview_title"]/text()').extract()   

    yield'Source':source, 'Reference':reference, 'Titre':titre

这是我的谦虚尝试。我知道那是假的。你能纠正我吗？

【问题讨论】：

XHR 请求是带有标头 X-Requested-With: XMLHttpRequest 的普通请求（***：en.wikipedia.org/wiki/XMLHttpRequest）。但是有些服务器不检查它，你可以做正常的请求。您只需要此请求的 url。您可以使用 XPath 在 HTML 中找到它。或者您可以使用 Chrome/Firefox 中的 DevTools 来查看从浏览器发送到服务器的所有请求。 【参考方案1】：

您可以通过以下示例抓取数据：

# Importing the dependencies
# This is needed to create a lxml object that uses the css selector
 from lxml.etree import fromstring
  
# The requests library
import requests
  
class WholeFoodsScraper:
  
  API_url = 'http://www.wholefoodsmarket.com/views/ajax'
    scraped_stores = []

    def get_stores_info(self, page):
     
        # This is the only data required by the api 
        # To send back the stores info
        data = 
        'view_name': 'store_locations_by_state',
        'view_display_id': 'state',
        'page': page
        
        # Making the post request
        response = requests.post(self.API_url, data=data)

        # The data that we are looking is in the second
        # Element of the response and has the key 'data', 
        # so that is what's returned
        return response.json()[1]['data']

【讨论】：

以上是关于需要帮助来模拟 xhr 请求的主要内容，如果未能解决你的问题，请参考以下文章