如何通过 Python Selenium BeautifulSoup 从网站中提取证券价格作为文本

Posted 2023-02-23

技术标签:

【中文标题】如何通过 Python Selenium BeautifulSoup 从网站中提取证券价格作为文本【英文标题】：How to extract the price for the security as text from the website through Python Selenium BeautifulSoup 【发布时间】：2019-07-10 05:55:42 【问题描述】：

我试图简单地获得https://investor.vanguard.com/529-plan/profile/4514 中显示的证券价格。我运行这段代码：

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox(executable_path=r'C:\Program_Files_EllieTheGoodDog\Geckodriver\geckodriver.exe')
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

当我在硒打开的 Firefox 中“检查元素”价格时，我清楚地看到了这一点：

<span data-ng-if="!data.isLayer" data-ng-bind-html="data.value" data-ng-class="sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF" class="ng-scope ng-binding arrange">$42.91</span >

但我的汤里没有这些数据。如果我打印我的汤，html 与网站上显示的完全不同。我试过了，但完全失败了：

myspan = soup.find_all('span', attrs='data-ng-if': '!data.isLayer', 'data-ng-bind-html': 'data.value', 'data-ng-class': 'sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF', 'class': 'ng-scope ng-binding arrange')

我完全被难住了。如果有人能指出我正确的方向，我将不胜感激。我觉得我完全错过了一些东西，可能有几件事......

【问题讨论】：

data-* 值可通过dataset developer.mozilla.org/en/docs/Web/API/HTMLElement/dataset 访问抱歉，但我不明白这是什么意思。我确信这只是另一个迹象，表明我不知道自己在做什么！不过谢谢。并非如此，只是以data- 开头的属性可以通过dataset[] 访问。例如<input id="ease" data-value="5"> 可以通过document.querySelector('input#ease').getAttribute('dataset')[value]访问 【参考方案1】：

您使用data_* 属性和值选择跨度的方式没有任何问题。事实上它是documentation中提到的正确方法。有4个span标签匹配所有属性。 find_all 将返回所有这些标签。第二个对应价格。

您错过的是 span 需要一些时间来加载，并且在此之前返回页面源。您可以 explicitly wait 获取该跨度，然后获取页面源。在这里，我使用 Xpath 来等待元素。你可以通过inspect tool -> right click element -> copy -> copy xpath获取xpath

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[3]/div[1]/div/div/table/tbody/tr[1]/td[2]/div/span[1]')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
myspan = soup.find_all('span', attrs='data-ng-if': '!data.isLayer', 'data-ng-bind-html': 'data.value', 'data-ng-class': 'sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF', 'class': 'ng-scope ng-binding arrange')
print(myspan)
print(myspan[1].text)

输出

[<span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF" data-ng-if="!data.isLayer">Unit price as of 02/15/2019</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF" data-ng-if="!data.isLayer">$42.91</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF" data-ng-if="!data.isLayer">Change</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF" data-ng-if="!data.isLayer"><span class="number-positive">$0.47</span> <span class="number-positive">1.11%</span></span>]
$42.91

【讨论】：

【参考方案2】：

Selenium 就足以提取所需的文本。您需要为visibility_of_element_located 诱导WebDriverWait，您可以使用以下解决方案：

代码块：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//tr[@class='ng-scope']//td[@class='ng-scope right']//span[@class='ng-scope ng-binding arrange' and @data-ng-bind-html]"))).get_attribute("innerHTML"))

控制台输出：

$42.91

【讨论】：

以上是关于如何通过 Python Selenium BeautifulSoup 从网站中提取证券价格作为文本的主要内容，如果未能解决你的问题，请参考以下文章