从 <a> 美丽的汤中提取 href

Posted 2023-02-23

技术标签:

【中文标题】从 <a> 美丽的汤中提取 href【英文标题】：extracting href from <a> beautiful soup 【发布时间】：2018-07-31 13:01:36 【问题描述】：

我正在尝试从谷歌搜索结果中提取链接。检查元素告诉我，我感兴趣的部分有“class= r”。第一个结果如下所示：

<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
    <a href="https://en.wikipedia.org/wiki/Chocolate" 
       ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://en.wikipedia.org/wiki/Chocolate&amp;ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM" 
       saprocessedanchor="true">
        Chocolate - Wikipedia
    </a>
</h3>

要提取我所做的“href”：

import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")

但我意外得到：

'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'

我想要的地方：

"https://en.wikipedia.org/wiki/Chocolate"

“ping”属性似乎让人困惑。有什么想法吗？

【问题讨论】：

也许可以查看原始源代码，因为 google 可能有数千行 javascript 使浏览器中的响应看起来不同。 【参考方案1】：

正如提到的另一个答案，这是因为没有指定user-agent。默认的 requests user-agent 是 python-requests，因此 Google 会阻止请求，因为它知道这是一个机器人而不是“真正的”用户访问。

User-agent 通过将此信息添加到HTTP request headers 来伪造用户访问。可以通过custom headers(check what's yours user-agent)来实现：

headers = 
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"

requests.get("YOUR_URL", headers=headers)

另外，为了得到更准确的结果你可以通过URL parameters:

params = 
  "q": "samurai cop, what does katana mean",  # query
  "gl": "in",                                 # country to search from
  "hl": "en"                                  # language
  # other parameters 

requests.get("YOUR_URL", params=params)

代码和full example in the online IDE（另一个答案的代码会因为CSS选择器更改而引发错误）：

from bs4 import BeautifulSoup
import requests, lxml

headers = 
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"


params = 
  "q": "samurai cop what does katana mean",
  "gl": "in",
  "hl": "en"


html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']
  print(f'title\nlink\n')

-------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw

Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647

Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481

...
'''

或者，您可以使用来自 SerpApi 的Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您只需要迭代结构化 JSON 并快速获取您想要的数据，而不是弄清楚为什么某些事情不能正常工作，然后随着时间的推移维护解析器。

要集成的代码：

import os
from serpapi import GoogleSearch

params = 
    "engine": "google",
    "q": "samurai cop what does katana mean",
    "hl": "en",
    "gl": "in",
    "api_key": os.getenv("API_KEY"),


search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])
  print(result['link'])
  print()

------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw

Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
...
'''

免责声明，我为 SerpApi 工作。

【讨论】：

【参考方案2】：

发生了什么？

如果您打印响应内容（即googleSoup.text），您会看到您得到的是完全不同的 HTML。页面来源和响应内容不匹配。

这不会发生，因为内容是动态加载的；即使那样，页面源和响应内容也是相同的。（但是您在检查元素时看到的 HTML 是不同的。）

对此的基本解释是 Google 识别 Python 脚本并更改其响应。

解决方案：

您可以传递 fake User-Agent 以使脚本看起来像真正的浏览器请求。

代码：

headers = 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

r = requests.get('https://www.google.co.in/search?q=chocolate', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

elements = soup.select('.r a')
print(elements[0]['href'])

输出：

https://en.wikipedia.org/wiki/Chocolate

资源：

Sending “User-agent” using Requests library in Python How to use Python requests to fake a browser visit? Using headers with the Python requests library's get method

【讨论】：

谢谢！现在结果与浏览器中的格式相同。

以上是关于从 <a> 美丽的汤中提取 href的主要内容，如果未能解决你的问题，请参考以下文章