如何从 Python Selenium 中的类中提取所有 href？

Posted 2023-03-06

技术标签:

【中文标题】如何从 Python Selenium 中的类中提取所有 href？【英文标题】：How to extract all href from a class in Python Selenium? 【发布时间】：2020-03-27 11:32:02 【问题描述】：

我正在尝试从 URL https://www.dx3canada.com/agenda/speakers 中提取人们的 href。

我试过了：

elems = driver.find_elements_by_css_selector('.display-flex card vancouver')
href_output = []
for ele in elems:
    href_output.append(ele.get_attribute("href"))
print(href_output)

但是输出列表什么也没返回...

预期的 href 如下图所示，我希望输出为 href 列表：

非常感谢您的帮助！

【问题讨论】：

【参考方案1】：

要从 URL https://www.dx3canada.com/agenda/speakers 中提取人员的 href 属性，因为所需元素位于 <iframe> 中，因此您必须：

诱导 WebDriverWait 使所需的框架可用并切换到它。诱导 WebDriverWait 获得 所有元素的可见性。

您可以使用以下Locator Strategies：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.dx3canada.com/agenda/speakers')
WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#whovaIframeSpeaker")))
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.display-flex.card.vancouver")))])

控制台输出：

['https://whova.com/embedded/speaker_detail/dcrma_202003/9942778/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907682/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907688/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907676/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907696/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907690/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907670/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907693/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9942779/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9908087/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907671/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907681/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907673/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907678/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907689/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907674/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907684/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907685/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907686/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9942780/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907695/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907687/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907683/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907692/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907672/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907697/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907680/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907679/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907675/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907677/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907694/']

这里可以找到Ways to deal with #document under iframe的相关讨论

【讨论】：

很好的答案——你能解释一下add_experimental_option 用于enable-automation 和useAutomationExtension 的用法吗？我以前从未见过这些选项，我很想知道它们的用途！ @Christine 简而言之，两个experimental_option() enable-automation 和 useAutomationExtension 我用于一些不同的事情，两个例子是，摆脱信息栏，不被检测为机器人等等。【参考方案2】：

您的图像位于iframe 中，因此您需要切换到此位置，然后才能使用frame_to_be_available_and_switch_to_it 抓取href 属性。

然后，要获取所有href 属性的列表，您可能需要运行一些javascript 将图像滚动到视图中，并处理图像可能延迟加载href 的情况：

# first, switch to iframe
WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@id='whovaIframeSpeaker']")))

elements_list = driver.find_elements_by_xpath("//div[contains(@class, 'template-section-body')]/a[contains(@class, 'display-flex card vancouver')]")

for element in elements_list:
    driver.execute_script("arguments[0].scrollIntoView(true);", element)
    print(element.get_attribute("href"))

这段代码的结果：

【讨论】：

谢谢你，克里斯汀。我尝试了上面的代码，但不起作用。 dx3canada.com/agenda/speakers 是我正在抓取的链接 @Bangbangbang 在查看了您的网页后，我发现问题现在是iframe。我已经更新了我的答案并对其进行了测试。 href 现在都打印成功了。【参考方案3】：

对于您的 css 选择器，请改用 .display-flex.card.vancouver。

elems = driver.find_elements_by_css_selector('.display-flex.card.vancouver')

每个单词都是一个类，所以你需要在每个单词的前面放置一个点。

【讨论】：

嗯。我添加了点，但仍然一无所获。

以上是关于如何从 Python Selenium 中的类中提取所有 href？的主要内容，如果未能解决你的问题，请参考以下文章