python - 通过导航下拉列表中的不同选项来抓取表格

Posted 2023-02-23

技术标签:

【中文标题】python - 通过导航下拉列表中的不同选项来抓取表格【英文标题】：python - scraping tables by navigating different options in drop down list 【发布时间】：2019-05-28 09:06:55 【问题描述】：

我正在尝试从该站点抓取数据：https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx

网站已将默认年份设置为 2018 年（最近一年），我想抓取所有可用年份。

4 年前有人问过一个非常相似的问题，但似乎没有用。

scraping a response from a selected option in dropdown list

当我运行它时，它为我所做的只是从默认年份打印出表格，而不管我分配的参数如何。

我无法通过 url 访问不同的年份，因为当我在下拉框中选择选项时 url 不会改变。所以我尝试使用 webdriver 和 xpath。

这是我尝试的代码：

url = "https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx"

driver = webdriver.Chrome("/Applications/chromedriver")
driver.get(url)

year = 2017
driver.find_element_by_xpath("//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@value='"+str(year)+"']").click()
page = driver.page_source
bs_obj = BSoup(page, 'html.parser')

header_row = bs_obj.find_all('table')[0].find('thead').find('tr').find_all('th')
body_rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')
footer_row = bs_obj.find_all('table')[0].find('tfoot').find('tr').find_all('td')

headings = []
footings = []

for heading in header_row:
    headings.append(heading.get_text())

for footing in footer_row:
    footings.append(footing.get_text())

body = []

for row in body_rows:
    cells = row.find_all('td')
    row_temp = []
    for i in range(len(cells)):
        row_temp.append(cells[i].get_text())
    body.append(row_temp)

driver.quit()
print(headings)
print(body)
print(footings)

我希望输出打印出我指定的 2017 年的表格，但实际输出打印出 2018 年（默认年份）的表格。谁能给我解决这个问题的想法？

编辑：我刚刚发现通过“检查”看到的与从“页面源”得到的不同。具体来说，页面源仍然有“2018”作为选择选项（这不是我想要的），而 Inspect 显示“2017”被选中。但仍然坚持如何使用“检查”而不是页面源。

【问题讨论】：

您的页面仍然是 2018 年的，当您找到 2017 年时单击它，以便表格更改为 2017 年的记录我确实在代码的第 6 行单击它（不包括空白行）。向右滚动代码，您将看到该行的其余代码。 【参考方案1】：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup as BSoup
url = "https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx"
driver = webdriver.Chrome("/Applications/chromedriver")
year = 2017
driver.get(url)
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.XPATH, "//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@value='"+str(year)+"']"))
)
element.click()
#its better to wait till some text has changed
#but this will do for now


WebDriverWait(driver, 3).until(
    EC.text_to_be_present_in_element(
        (By.XPATH, "//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@selected='selected']"),
        str(year)
    )
)
#sleep for some time to complete ajax load of the table
#sleep(10)
page = driver.page_source
bs_obj = BSoup(page, 'html.parser')

header_row = bs_obj.find_all('table')[0].find('thead').find('tr').find_all('th')
body_rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')
footer_row = bs_obj.find_all('table')[0].find('tfoot').find('tr').find_all('td')

headings = []
footings = []

for heading in header_row:
    headings.append(heading.get_text())

for footing in footer_row:
    footings.append(footing.get_text())

body = []

for row in body_rows:
    cells = row.find_all('td')
    row_temp = []
    for i in range(len(cells)):
        row_temp.append(cells[i].get_text())
    body.append(row_temp)

driver.quit()
print(headings)
print(body)

输出

['순위', '팀명', 'AVG', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'TB', 'RBI', 'SAC', 'SF']
[['1', 'KIA', '0.302', '144', '5841', '5142', '906', '1554', '292', '29', '170', '2414', '868', '55', '56'], ['2', '두산', '0.294', '144', '5833', '5102', '849', '1499', '270', '20', '178', '2343', '812', '48', '47'], ['3', 'NC', '0.293', '144', '5790', '5079', '786', '1489', '277', '19', '149', '2251', '739', '62', '48'], ['4', '넥센', '0.290', '144', '5712', '5098', '789', '1479', '267', '30', '141', '2229', '748', '21', '42'], ['5', '한화', '0.287', '144', '5665', '5030', '737', '1445', '261', '16', '150', '2188', '684', '85', '38'], ['6', '롯데', '0.285', '144', '5671', '4994', '743', '1425', '250', '17', '151', '2162', '697', '76', '32'], ['7', 'LG', '0.281', '144', '5614', '4944', '699', '1390', '216', '20', '110', '1976', '663', '76', '55'], ['8', '삼성', '0.279', '144', '5707', '5095', '757', '1419', '255', '36', '145', '2181', '703', '58', '55'], ['9', 'KT', '0.275', '144', '5485', '4937', '655', '1360', '274', '17', '119', '2025', '625', '62', '45'], ['10', 'SK', '0.271', '144', '5564', '4925', '761', '1337', '222', '15', '234', '2291', '733', '57', '41']]

单击后，您必须等待一段时间才能刷新表格。也读我的cmets。睡眠不是最好的选择。

编辑：

我已编辑代码以等待所选文本为年份。代码不再使用睡眠。

【讨论】：

非常感谢！我被困了几天，它终于起作用了！！！顺便说一句，您怎么知道文本何时更改？我可以使用某个库吗？ @JoshuaKim 阅读此obeythetestinggoat.com/… 非常感谢您的回复

以上是关于python - 通过导航下拉列表中的不同选项来抓取表格的主要内容，如果未能解决你的问题，请参考以下文章

Laravel，为选择下拉列表中的选项添加不同的html属性

(selenium+python)_UI自动化05_定位select下拉列表

如何使用 Flask 和 HTML 从 python 列表创建下拉菜单

如何根据Angular 6同一行中的其他单元格值在AG-Grid选择下拉列表中加载不同的选项？

如何验证下拉列表中的“其他”选项？

每当我更改第一个下拉列表中的选项时，需要 jquery 将第二个下拉列表的默认选项更改为“选择”