试图用漂亮的汤从***上刮下一个季后赛支架。如何识别正确的列？

Posted 2023-03-05

技术标签:

【中文标题】试图用漂亮的汤从***上刮下一个季后赛支架。如何识别正确的列？【英文标题】：Trying to scrape a playoff bracket from wikipedia using beautiful soup. How do I identify the correct columns? 【发布时间】：2019-09-23 15:03:10 【问题描述】：

我正在尝试从 1988 年的***中抓取 nhl 季后赛支架，在 python 中使用漂亮的汤 4。格式不一致（有时一排有多个团队，请参阅：(https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs) 让这变得困难。我想确定当年每个系列的团队、回合和赢得的游戏数。

最初，我将表格转换为文本并使用正则表达式来识别团队和信息，但顺序会根据括号是否允许每行超过一个团队而发生变化。

现在我正在尝试逐行计算并计算单元格/列跨度等内容，但结果不一致。我错过了如何确定第 4 轮球队。

到目前为止，我试图在达到一个团队之前计算单元格的数量......

from bs4 import BeautifulSoup as soup
hockeyteams = ['Anaheim','Arizona','Atlanta','Boston','Buffalo','Calgary','Carolina','Chicago','Colorado','Columbus','Dallas','Detroit',
               'Edmonton','Florida','Hartford','Los Angeles','Minnesota','Montreal','Nashville','New Jersey',
               'Ottawa','Philadelphia','Pittsburgh','Quebec','San Jose','St. Louis','Tampa Bay','Toronto','Vancouver','Vegas','Washington',
               'Winnipeg','NY Rangers','NY Islanders']

#fetch the content from the url from the library
page_response = requests.get(full_link, timeout=5)
#use the html parser to parse the url
page_content = soup(page_response.content, "html.parser")

tables = page_content.find_all('table')
cnt = 0

#identify the appropriate table
for table in tables:
    if ('Semi' in table.text) & ('Stanley Cup Finals' in table.text):
        bracket = table
        break
row_num = 0        
for row in bracket.find_all('tr'):
    row_num += 1
    print(row_num,'#')
    colcnt = 0
    for col in row.find_all('td'):
        if "colspan" in col.attrs:
            colcnt += int(col.attrs['colspan'])
        else:
            colcnt += 1
        if (col.text.strip(' \n') in str(hockeyteams)):
            print(colcnt,col.text)
        
            
    print('col width:',colcnt)

最终我想要一个数据框之类的东西：

A队A队获胜，B队B队获胜 1，坦帕湾，4，纽约岛民，1 2，坦帕湾，4，蒙特利尔，0

等

【问题讨论】：

【参考方案1】：

那个表可以用pandas刮：

import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs#Playoff_bracket')

bracket = tables[2].dropna(axis=1, how='all').dropna(axis=0, how='all')
print(bracket)

输出中充满了NaNs，但它包含我认为您正在寻找的内容，您可以使用标准的 pandas 方法对其进行修改。

【讨论】：

谢谢，但我认为同样的问题是提取的列位置不是特定于轮次的。我不太明白。你能张贴包含问题区域和你期望的方式的表格吗？运行代码，斯坦利杯决赛标题在第7栏，坦帕湾在第4栏，卡尔加里在第3栏。第二轮系列坦帕/蒙特利尔出现在第7栏。我'我不确定我可以使用哪些规则来准确解释这一点。

以上是关于试图用漂亮的汤从***上刮下一个季后赛支架。如何识别正确的列？的主要内容，如果未能解决你的问题，请参考以下文章

如何使用美丽的汤从脚本标签中提取 json？

如何使用美丽的汤从 kick starter 中获取以下数据？