使用 BeautifulSoup 在评论标签中抓取表格

Posted 2023-02-21

技术标签:

【中文标题】使用 BeautifulSoup 在评论标签中抓取表格【英文标题】：Using BeautifulSoup to scrape tables within comment tags 【发布时间】：2018-02-28 12:38:29 【问题描述】：

我正在尝试使用 BeautifulSoup 从以下网页中抓取表格： https://www.pro-football-reference.com/boxscores/201702050atl.htm

import requests
from bs4 import BeautifulSoup

url = 'https://www.pro-football-
reference.com/boxscores/201702050atl.htm'
page = requests.get(url)
html = page.text

页面上的大部分表格都在评论标签内，因此无法直接访问。

print(soup.table.text)

1
2
3
4
OT
Final







via Sports Logos.net
About logos


New England Patriots
0
3
6
19 
6
34





via Sports Logos.net
About logos


Atlanta Falcons
0
21
7
0
0
28

即缺少包含球员统计数据的主表格。我试图简单地使用删除评论标签

html = html.replace('<!--',"")
html = html.replace('-->',"")

但无济于事。如何访问这些已注释掉的表格？

【问题讨论】：

完全采取不同的处理路线。在 Chrome 浏览器中使用 selenium。 SO上有很多问题和答案可以指导您。我没有看到任何页面上的表格在评论标签内。你能以某种方式展示吗？ @RomanPerekhrest 例如，页面下方四分之一处名为“传球、冲球和接球”的表格，其中包含球员的统计数据。当我在 chrome 中查看页面源代码时，该表似乎包含在从 HTML 代码第 864 行开始的注释形式中。不知道我错过了什么，我真的没有 HTML 经验...... 【参考方案1】：

如果其他人有兴趣在不使用 selenium 的情况下从 cmets 获取表格。

You can grab all the comments，然后检查表是否存在并将该文本传递回 BeautifulSoup 以解析表。

import requests
from bs4 import BeautifulSoup, Comment

r = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')

if r.status_code == 200:
    soup = BeautifulSoup(r.content, 'html.parser')

    for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
        if comment.find("<table ") > 0:
            comment_soup = BeautifulSoup(comment, 'html.parser')
            table = comment_soup.find("table")

明智的做法可能是让它更健壮一点，以确保整个表格存在于同一个评论中。

【讨论】：

【参考方案2】：

给你。您可以从该页面获取任何表格，只需更改索引号。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm').text

soup = BeautifulSoup(page,'lxml')
table = soup.find_all('table')[1]  #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
                        for rowdata in table.find_all("tr")]
for data in tab_data:
    print(' '.join(data))

由于除了前两个表之外的其他表都在 javascript 中，这就是为什么您需要使用 selenium 来崩溃并解析它们。您现在肯定可以从该页面访问任何表格。这是修改后的。

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
table = soup.find_all('table')[7]  #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
                        for rowdata in table.find_all("tr")]
for data in tab_data:
    print(' '.join(data))

【讨论】：

谢谢 - 尝试了您的代码，但不幸的是，就像 @user666 的解决方案一样，只检索到 2 个表。当我将索引更改为大于 1 的值时，我收到“列表索引超出范围”错误消息好的，我告诉了你如何处理这样的事情。您能否指定要解析的数据？告诉我标题名称或任何部分截图。请注意，具体一点。页面下方有包含球员统计数据的表格。例如“传球、冲球和接球”，或者另一个例子是“防守”表。这些是我想要的:) 现在，查看编辑后的代码。我认为该页面中的任何表格都无法摆脱您的控制。试一试。顺便说一句，确保你的机器上安装了 selenium。硒就是答案！谢谢【参考方案3】：

我可以使用 Beautiful Soup 和 Pandas 解析表格，这里有一些代码可以帮助你。

import requests
from bs4 import BeautifulSoup
import pandas as pd    

url = 'https://www.pro-football-reference.com/boxscores/201702050atl.htm'
page = requests.get(url)

soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[1]
# Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]

df 现在包含以下内容：

    Quarter Time    Tm  Detail  NWE ATL
0   2   12:15   Falcons Devonta Freeman 5 yard rush (Matt Bryant kick)  0   7
1   NaN 8:48    Falcons Austin Hooper 19 yard pass from Matt Ryan (Mat...   0   14
2   NaN 2:21    Falcons Robert Alford 82 yard interception return (Mat...   0   21
3   NaN 0:02    Patriots    Stephen Gostkowski 41 yard field goal   3   21
4   3   8:31    Falcons Tevin Coleman 6 yard pass from Matt Ryan (Matt...   3   28

【讨论】：

感谢您的回答。不幸的是，它只检索了两个表，但仍然没有检索到更下方的表（例如“Defense”表）

以上是关于使用 BeautifulSoup 在评论标签中抓取表格的主要内容，如果未能解决你的问题，请参考以下文章