在 python 中使用 selenium 从动态网站获取数据：如何发现数据库查询的完成方式？

Posted 2023-05-09

技术标签:

【中文标题】在 python 中使用 selenium 从动态网站获取数据：如何发现数据库查询的完成方式？【英文标题】：Using selenium in python to get data from dynamic website: how to discover the way databases querys are done? 【发布时间】：2019-02-17 14:39:09 【问题描述】：

我之前有一些编码经验，但不是专门针对 Web 应用程序的。我的任务是从这个网站获取数据：http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/

它们每天都可用。我在 Python 中使用过 selenium，到目前为止效果很好：我可以获取整个表，将其存储在 pandas 数据框中，然后存储到 mysql 数据库等。问题是：网站的结果总是一样的！

这是我的代码：

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
def GetDataFromWeb(day, month, year):
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.send_keys("/".join((str(day),str(month),str(year))))
date = driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(5)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

无论我向其发送什么输入，此函数生成的表始终相同。它们似乎来自 06/09/2018 的相应日期（月=09，日=06）。我认为主要问题是我不知道对他们的数据库的查询是如何完成的，所以这总是像“默认日期”一样运行。我读过一些人谈论 Ajax 和 javascript 请求，但我不知道是不是这样。我怎么知道？

【问题讨论】：

【参考方案1】：

此代码将起作用（在您的代码中更新了几行）

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
def GetDataFromWeb(day, month, year):

***#to avoid data error in date handler***
if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.clear() ***#to clear auto populated data***
date.send_keys(((str(day),str(month),str(year)))) ***# removed the join part***

driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(50)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

print GetDataFromWeb(3,9,2018)

它将打印所需日期的匹配数据。

我已添加#以避免日期处理程序中的数据错误

if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

date.clear() #清除自动填充的数据 date.send_keys(((str(day),str(month),str(year)))) #删除了连接部分

请注意，您的代码中的问题是日期和月份字段采用两位数，date.send_keys("/".join((str(day), str(month), str(year)))) 行生成错误，因此选择了系统日期，并且您总是会看到任何输入数据的相同数据。此外，当您单击它选择默认日期的日期时，我们必须首先清除该日期并发送自定义日期。希望这会有所帮助

附加查询更新：添加这些导入

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

添加这一行代替等待

WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#divContainerIframeBmf > form > div > div > div:nth-child(1) > div:nth-child(3) > div > div > p')))

【讨论】：

非常感谢！如果问得不算多，您能否至少给我一个提示，告诉我如何更改此“time.sleep(50)”行？我想使用webdriver.wait，条件是网站的数据库查询完成，但我不知道它是如何做到的。有没有“jquery done”之类的？添加这一行代替 sleep WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#divContainerIframeBmf > form > div > div > div:nth-child(1) > div:nth-child(3) > div > div > p'))) 从 selenium.webdriver.support.ui 导入 WebDriverWait 从 selenium.webdriver.support 导入预期条件作为 EC 从 selenium.webdriver.common.by导入方式

以上是关于在 python 中使用 selenium 从动态网站获取数据：如何发现数据库查询的完成方式？的主要内容，如果未能解决你的问题，请参考以下文章