使用 BeautifulSoup 时丢失信息

Posted 2023-03-06

技术标签:

【中文标题】使用 BeautifulSoup 时丢失信息【英文标题】：Losing information when using BeautifulSoup 【发布时间】：2019-12-22 17:19:51 【问题描述】：

我正在遵循“使用 Python 自动化无聊的东西”的指南练习一个名为“Project: “I'm Feeling Lucky” Google Search'的项目

但 CSS 选择器不返回任何内容

import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
    address = ' '.join(sys.argv[1:])
else:
    address = pyperclip.paste()

res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))**

我已经在 IDLE shell 中测试了相同的代码

好像

linkElems = soup.select('.r')

什么都不返回

在我检查了美丽汤返回的值之后

soup = bs4.BeautifulSoup(res.text,"html.parser")

我发现所有class='r' 和class='rc' 都无缘无故消失了。但它们在原始 HTML 文件中。

请告诉我为什么以及如何避免此类问题

【问题讨论】：

【参考方案1】：

Google 阻止您的请求的原因是因为默认请求用户代理是python-requests。 Check what's your user-agent 从而阻止您的请求并导致具有不同元素和选择器的完全不同的 HTML。但有时您可以在使用 user-agent 时收到不同的 HTML，使用不同的选择器。

详细了解user-agent 和HTTP request headers。

将user-agent 传递给请求headers：

headers = 
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"


requests.get('YOUR_URL', headers=headers)

尝试改用lxml解析器it's faster.

代码和full example in the online IDE：

from bs4 import BeautifulSoup
import requests

headers = 
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"


params = 
  "q": "My query goes here"


html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

-----

'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://***.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来做同样的事情。这是一个带有免费计划的付费 API。

您的情况不同的是，您只需要从 JSON 字符串中提取所需的数据，而不是弄清楚如何从 Google 提取、维护或绕过块。

要集成的代码：



params = 
    "engine": "google",
    "q": "My query goes here",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),


search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://***.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

免责声明，我为 SerpApi 工作。

【讨论】：

【参考方案2】：

要获取定义类r的HTML版本，需要在标题中设置User-Agent：

import requests
from bs4 import BeautifulSoup

address = 'linux'

headers='User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'

res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")

linkElems = soup.select('.r a')

for a in linkElems:
    if a.text.strip() == '':
        continue
    print(a.text)

打印：

Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux

...and so on.

【讨论】：

非常感谢，它有效！但我还是不知道原因 @Tritium 一些网站根据User-Agent返回不同的HTML版本。谷歌就是其中之一。但是，是的，有时很难找到它 - 而且它会发生不可预测的变化。

以上是关于使用 BeautifulSoup 时丢失信息的主要内容，如果未能解决你的问题，请参考以下文章