无法使用请求从 zillow 中抓取自定义属性链接

Posted

技术标签:

【中文标题】无法使用请求从 zillow 中抓取自定义属性链接【英文标题】:Can't scrape customized property links from zillow using requests 【发布时间】:2021-07-11 18:31:30 【问题描述】:

我正在尝试解析当我从 zillow 中选择两个下拉列表时填充的不同属性链接。选择完选项后,我可以在开发工具中看到 json 格式的结果。但是,当我使用下面的脚本执行相同操作时,会得到一些奇怪的文本。

手动操作:

    导航到that site 从first dropdown 中选择选项 从second dropdown 中选择选项

这就是我尝试自动化的方式:

import json
import requests
from pprint import pprint

link = 'https://www.zillow.com/search/GetSearchPageState.htm?'

params = 
    'searchQueryState': "pagination":,"usersSearchTerm":"Vista, CA","mapBounds":"west":-117.44051346728516,"east":-116.99488053271484,"south":33.126944633035116,"north":33.27919773006566,"regionSelection":["regionId":41517,"regionType":6],"isMapVisible":True,"filterState":"doz":"value":"6m","isForSaleByAgent":"value":False,"isForSaleByOwner":"value":False,"isNewConstruction":"value":False,"isForSaleForeclosure":"value":False,"isComingSoon":"value":False,"isAuction":"value":False,"isPreMarketForeclosure":"value":False,"isPreMarketPreForeclosure":"value":False,"isRecentlySold":"value":True,"isAllHomes":"value":True,"hasPool":"value":True,"hasAirConditioning":"value":True,"isApartmentOrCondo":"value":False,"isListVisible":True,"mapZoom":11,
    'wants': "cat1":["listResults","mapResults"],
    'requestId': 2


with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link,params=json.dumps(params))
    pprint(res.content)

这是它产生的输出:

b'<!-- This page outputs JSON instead of anything written here. -->'

如何使用请求解析来自 zillow 的自定义属性链接?

【问题讨论】:

【参考方案1】:

您必须对出现在request URL 中的查询字符串进行编码。

为此,您需要:

urllib.parse.urlencode()

这是一个工作示例:

import json
import urllib.parse

import requests

link = 'https://www.zillow.com/search/GetSearchPageState.htm?'

params = 
    'searchQueryState': 
        "pagination": ,
        "usersSearchTerm": "Vista, CA",
        "mapBounds": 
            "west": -117.44051346728516,
            "east": -116.99488053271484,
            "south": 33.126944633035116,
            "north": 33.27919773006566
        ,
        "regionSelection": ["regionId": 41517, "regionType": 6],
        "isMapVisible": True,
        "filterState": 
            "doz": "value": "6m", "isForSaleByAgent": "value": False,
            "isForSaleByOwner": "value": False, "isNewConstruction": "value": False,
            "isForSaleForeclosure": "value": False, "isComingSoon": "value": False,
            "isAuction": "value": False, "isPreMarketForeclosure": "value": False,
            "isPreMarketPreForeclosure": "value": False,
            "isRecentlySold": "value": True, "isAllHomes": "value": True,
            "hasPool": "value": True, "hasAirConditioning": "value": True,
            "isApartmentOrCondo": "value": False
        ,
        "isListVisible": True,
        "mapZoom": 11
    ,
    'wants': "cat1": ["listResults"],
    'requestId': 2


with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    s.headers["x-requested-session"] = "BE6D8DA620E60010D84B55EB18DC9DC8"
    s.headers["cookie"] = f"JSESSIONID=s.headers['x-requested-session']"
    data = json.dumps(
        json.loads(s.get(f"linkurllib.parse.urlencode(params)").content),
        indent=2
    )
    print(data)

输出:


  "user": 
    "isLoggedIn": false,
    "hasHousingConnectorPermission": false,
    "savedSearchCount": 0,
    "savedHomesCount": 0,
    "personalizedSearchGaDataTag": null,
    "personalizedSearchTraceID": "607a9ecb5aabe489c361c1d91f368b37",
    "searchPageRenderedCount": 0,
    "guid": "33b7add3-bfd3-4d85-a88a-d9d99256d2a2",
    "zuid": "",
    "isBot": false,
    "userSpecializedSEORegion": false
  ,
  "mapState": 
    "customRegionPolygonWkt": null,
    "schoolPolygonWkt": null,
    "isCurrentLocationSearch": false,
    "userPosition": 
      "lat": null,
      "lon": null
    ,
    "regionBounds": 
      "north": 33.275284,
      "east": -117.145153,
      "south": 33.130865,
      "west": -117.290241
    
  ,

and much much more ...

注意:在该网站上放轻松,因为他们有非常敏感的反机器人措施,如果您继续太快地请求数据,他们会向您扔 CAPTCHA。

【讨论】:

以上是关于无法使用请求从 zillow 中抓取自定义属性链接的主要内容,如果未能解决你的问题,请参考以下文章

使用 Beautiful Soup 查找特定类

无法使用请求抓取 graphql 页面

基于scrapy源码实现的自定义微型异步爬虫框架

Prometheus 使用自定义标头抓取 /metric

Django模板从请求或“刷新”单选按钮获取自定义属性

在 Zillow 和其他 API 网站中解析 XML 数据