BeautifulSoup - 亚马逊和谷歌将我识别为机器人;我该如何解决?

Posted

技术标签:

【中文标题】BeautifulSoup - 亚马逊和谷歌将我识别为机器人;我该如何解决?【英文标题】:BeautifulSoup - Amazon and Google identify me as a robot; how can i fix it? 【发布时间】:2021-12-20 19:04:11 【问题描述】:

一段时间后(在分析了大约 100/200 种产品之后)我在亚马逊和谷歌购物时使用 BeautifulSoup 进行抓取时,它会将我识别为机器人,我该如何防止这种情况发生?

通过更改 ip 我可以重新启动,但过了一会儿他们又阻止了我。

这是我的代码:

from bs4 import BeautifulSoup
import requests

cookies_goo = 
    "NID": "511=ktkACo_ZFBfZiD_DvYTKQFmYYX7R3Esh1ZtJ6A3F87KG_YzkbqlHc0NmQsGPyc78KIOXyCtVuYE9QmX-ixl-HzpbE9N9K67sGQCTZ2CFZ1oZAhe-iSFKtCcsUCsY8CHmbDu9YtxaEs7prgZqRID19DI6bqN2lxQZjog8HY6ur_M",
    "1P_JAR": "2021-11-05-13",
    "CONSENT": "YES+cb.20211102-08-p0.it+FX+548"


header = 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/95.0.4638.54 Safari/537.36",
    "Accept-Language": "it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7"


response = requests.get(url, headers=header, cookies=cookies_goo)
soup = BeautifulSoup(response.content, "lxml")

【问题讨论】:

这些网站只是在执行他们的服务条款;没有什么灵丹妙药可以避开他们的过滤器,因为他们更有可能不使用高级启发式算法来检测您对他们策略的自动抓取的使用。改用他们的 API,因为他们希望您这样做,而不是访问开销相对较大的前端页面。 【参考方案1】:

一个机器人,所以他们的算法是完全正确的。尝试改用他们的 API。

【讨论】:

【参考方案2】: 轮换代理 延误 避免使用相同的模式 IP 速率限制(可能是您的问题

IP 速率限制。这是一个基本的安全系统,可以禁止或阻止来自同一 IP 的传入请求。这意味着普通用户不会在几秒钟内以完全相同的模式(滚动、单击、滚动、单击、打开。例如)向同一个域发出 100 个请求。

How to reduce the chance of being blocked while web scraping search engines.


或者,您可以使用来自 SerpApi 的 Google Shopping Results API。这是一个带有免费计划的付费 API。

您的情况的不同之处在于,您不必花时间弄清楚如何绕过 Google 的阻止,因为它已经为最终用户完成了。

用于解析来自 Google Shopping 和 example in the online IDE 的数据的示例代码:

import os
from serpapi import GoogleSearch


params = 
    "api_key": os.getenv("API_KEY"),
    "engine": "google_product",
    "product_id": "14506091995175728218", # can be iterated over multiple product ids
    "gl": "us",                           # country to search from
    "hl": "en"                            # language


search = GoogleSearch(params)
results = search.get_dict()

title = results['product_results']['title']
prices = results['product_results']['prices']
reviews = results['product_results']['reviews']
rating = results['product_results']['rating']
extensions = results['product_results']['extensions']
description = results['product_results']['description']
user_reviews = results['product_results']['reviews']
reviews_results = results['reviews_results']['ratings']

print(f'title\n'
    f'prices\n'
    f'reviews\n'
    f'rating\n'
    f'extensions\n'
    f'description\n'
    f'user_reviews\n'
    f'reviews_results')


'''
Google Pixel 4 White 64 GB, Unlocked
['$247.79', '$245.00', '$439.00']
526
3.7
['October 2019', 'Google', 'Pixel Family', 'Pixel 4', 'android', '5.7″', 'Facial Recognition', '8 MP front camera', 'Smartphone', 'With Wireless Charging']
Point and shoot for the perfect photo. Capture brilliant color and control the exposure balance of different parts of your photos. Get the shot without the flash. Night Sight is now faster and easier to use it can even take photos of the Milky Way. Get more done with your voice. The new Google Assistant is the easiest way to send texts, share photos, and more. A new way to control your phone. Quick Gestures let you skip songs and silence calls – just by waving your hand above the screen. End the robocalls. With Call Screen, the Google Assistant helps you proactively filter our spam before your phone ever rings.
526
['stars': 1, 'amount': 101, 'stars': 2, 'amount': 43, 'stars': 3, 'amount': 39, 'stars': 4, 'amount': 73, 'stars': 5, 'amount': 270]
'''

迭代多个项目 ID 的示例:

# import os
# from serpapi import GoogleSearch


# random numbers except the first one
products = ['14506091995175728218', '1450609199517512118', '145129895175728218']


for product in products:
    params = 
        "api_key": os.getenv("API_KEY"),
        "engine": "google_product",
        "product_id": product,
        "gl": "us",
        "hl": "en"   
    

    search = GoogleSearch(params)
    results = search.get_dict()

    title = results['product_results']['title']

    print(title, sep='\n')  # prints 3 titles from 3 different products

免责声明,我为 SerpApi 工作。

【讨论】:

以上是关于BeautifulSoup - 亚马逊和谷歌将我识别为机器人;我该如何解决?的主要内容,如果未能解决你的问题,请参考以下文章

向亚马逊和谷歌提交应用程序后的 Android 权限

交易总额高达600亿美元?亚马逊微软和谷歌完成100多笔并购

有人使用 Beautifulsoup 从亚马逊抓取成功吗?

使用 python 3 和 beautifulsoup 从亚马逊抓取图像

防止 BeautifulSoup 将我的 XML 标记转换为小写

BeautifulSoup 模块未检测到任何标签