美丽的汤 CSS 选择器没有找到任何东西

Posted 2023-03-06

技术标签:

【中文标题】美丽的汤 CSS 选择器没有找到任何东西【英文标题】：Beautiful Soup CSS selector not finding anything 【发布时间】：2020-03-15 21:11:09 【问题描述】：

我正在使用 Python 3。下面的代码应该让用户在命令行中输入搜索词，然后搜索 Google 并运行结果页面的 html 以查找与 CSS 选择器匹配的标签 ( '.r a')。

假设我们搜索“猫”一词。我知道我正在寻找的标签存在于“猫”搜索结果页面上，因为我自己查看了页面源代码。

但是当我运行我的代码时，linkElems 列表是空的。出了什么问题？

    import requests, sys, bs4

    print('Googling...')
    res = requests.get('http://google.com/search?q='  +' '.join(sys.argv[1:]))
    print(res.raise_for_status())

    soup = bs4.BeautifulSoup(res.text, 'html5lib')
    linkElems = soup.select(".r a")
    print(linkElems)

【问题讨论】：

下面的论坛上有人和我有同样的问题。有人说这可能与javascript有关，但我不明白发布的解决方案。 python-forum.io/Thread-I-m-Feeling-Lucky-script-problem-again 【参考方案1】：

“.r”类由 Javascript 呈现，因此在收到的 HTML 中不可用。您可以使用 selenium 或类似方法呈现 javascript，或者您可以尝试更具创造性的解决方案来从标签中提取链接。首先通过在没有“.r”类的情况下查找标签来检查标签是否存在。 soup.find_all("a") 那么作为一个例子你可以使用regex to extract所有以“/url?q="

开头的url

import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))

【讨论】：

感谢您的回答，我会尝试 Selenium 并报告。另外，您能否指出一个资源来了解哪些类是由 JavaScript 呈现的，哪些是纯 HTML？（我试图更好地了解 JS 与 Python 中 requests 模块的限制之间的关系。特别是，如果 Requests 无法获得 JS 渲染的类，我想知道 Requests 模块的其他限制是什么是。） JS 在浏览器中执行，并且由于 Selenium 使用浏览器，它能够呈现它。除了检查响应之外，我不知道有任何方法或资源可以确定呈现哪些类。 Google 可能有一些高级方法来防止内容被抓取。【参考方案2】：

您要提取的部分不是由 JavaScript 呈现，如提到的 Matts 并且您不需要 regex 来执行此类任务。

确保您使用的是user-agent，否则 Google 最终会阻止您的请求。这可能是你得到一个空输出的原因，因为你收到了一个完全不同的 HTML。 Check what is your user-agent。我已经回答了what is user-agent and HTTP headers。

将user-agent 传递到HTTP headers：

headers = 
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"


requests.get("YOUR_URL", headers=headers)

html5lib is the slowest parser，尝试改用lxml，这样更快。如果您想使用更快的解析器，请查看selectolax。

代码和full example in the online IDE：

from bs4 import BeautifulSoup
import requests

headers = 
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"


params = 
  "q": "selena gomez"


html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于您不必处理解析部分，相反，您只需要遍历结构化 JSON 并获取所需的数据，而且您不必维护解析器时间。

要集成的代码：

import os
from serpapi import GoogleSearch

params = 
    "engine": "google",
    "q": "selena gomez",
    "api_key": os.getenv("API_KEY"),


search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  link = result['link']
  print(link)

----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''

P.S - 我写了一篇关于如何抓取 Google Organic Search Results 的博文。

免责声明，我为 SerpApi 工作。

【讨论】：

以上是关于美丽的汤 CSS 选择器没有找到任何东西的主要内容，如果未能解决你的问题，请参考以下文章