如何使用 Beautiful Soup 获取锚标签的 href?
Posted
技术标签:
【中文标题】如何使用 Beautiful Soup 获取锚标签的 href?【英文标题】:How can I get the href of anchor tag using Beautiful Soup? 【发布时间】:2019-05-30 23:50:48 【问题描述】:我正在尝试使用 Beautiful Soup 获取 YouTube 上第一个视频搜索的锚标记的 href。我正在使用“a”和 class_="yt-simple-endpoint style-scope ytd-video-renderer" 来搜索它。 但我得到无输出:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
# print(soup2.prettify())
a =soup.findAll("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
a_fin = soup.find("a", class_="compact-media-item-image")
#
print(a)
【问题讨论】:
BeautifulSoup getting href的可能重复 您从requests.get()
获得的html 源代码中没有class="yt-simple-endpoint style-scope ytd-video-renderer"
。这就是为什么你得到 None
retrieve links from web page using python and BeautifulSoup的可能重复
【参考方案1】:
您正在搜索的类在抓取的 html 中不存在。您可以通过打印汤变量来识别它。 例如:
a =soup.findAll("a", class_="sign-in-link")
输出如下:
[<a class="sign-in-link" href="https://accounts.google.com/ServiceLogin?passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26feature%3Dplaylist%26hl%3Den%26next%3D%252Fresults%253Fsearch_query%253DMP%252Belection%252Bresults%252B2018%25253A%252BBJP%252Bminister%252Bblames%252Bconspiracy%252Bas%252Breason%252Bwhile%252Blosing&uilel=3&hl=en&service=youtube">Sign in</a>]
【讨论】:
【参考方案2】:from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
first_serach_result_link = soup.findAll('a',attrs='class':'yt-uix-tile-link')[0]['href']
深受 this 回答的启发
【讨论】:
【参考方案3】:另一种选择是首先使用 Selenium 呈现页面。
import bs4
from selenium import webdriver
url = 'https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing'
browser = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
browser.get(url)
source = browser.page_source
soup = bs4.BeautifulSoup(source,'html.parser')
hrefs = soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
for a in hrefs:
print (a['href'])
输出:
/watch?v=Jor09n2IF44
/watch?v=ym14AyqJDTg
/watch?v=g-2V1XJL0kg
/watch?v=eeVYaDLC5ik
/watch?v=StI92Bic3UI
/watch?v=2W_4LIAhbdQ
/watch?v=PH1WZPT5IKw
/watch?v=Au2EH3GsM7k
/watch?v=q-j1HEnDn7w
/watch?v=Usjg7IuUhvU
/watch?v=YizmwHibomQ
/watch?v=i2q6Fm0E3VE
/watch?v=OXNAMyEvcH4
/watch?v=vdcBtAeZsCk
/watch?v=E4v2StDdYqs
/watch?v=x7kCuRB0f7E
/watch?v=KERtHNoZrF0
/watch?v=TenbA4wWIJA
/watch?v=Ey9HfjUyUvY
/watch?v=hqsuOT0URJU
【讨论】:
【参考方案4】:你可以使用 Selenium 获取动态 html 或使用 GoogleBot 用户代理获取静态 html
headers = 'User-Agent' : 'Googlebot/2.1 (+http://www.google.com/bot.html)'
source = requests.get("https://.......", headers=headers).text
soup = BeautifulSoup(source, 'lxml')
links = soup.findAll("a", class_="yt-uix-tile-link")
for link in links:
print(link['href'])
【讨论】:
【参考方案5】:尝试遍历匹配项:
import urllib2
data = urllib2.urlopen("some_url")
html_data = data.read()
soup = BeautifulSoup(html_data)
for a in soup.findAll('a',href=True):
print a['href']
【讨论】:
以上是关于如何使用 Beautiful Soup 获取锚标签的 href?的主要内容,如果未能解决你的问题,请参考以下文章
启用以使用 Beautiful Soup 获取特定网站的 img 标签
使用 Beautiful Soup 提取链接的等效正则表达式
如何使用 Beautiful Soup 查找 id 变化的标签?