BeautifulSoup：获取特定表的内容

Posted 2023-02-23

技术标签:

【中文标题】BeautifulSoup：获取特定表的内容【英文标题】：BeautifulSoup: Get the contents of a specific table 【发布时间】：2011-02-25 12:17:44 【问题描述】：

My local airport 可耻地阻止没有 IE 的用户，看起来很糟糕。我想编写一个 Python 脚本，每隔几分钟获取到达和离开页面的内容，并以更易读的方式显示它们。

我选择的工具是 mechanize 用于欺骗网站相信我使用 IE，以及 BeautifulSoup 用于解析页面以获取航班数据表。

老实说，我迷失在 BeautifulSoup 文档中，无法理解如何从整个文档中获取表（我知道其标题），以及如何从该表中获取行列表。

有什么想法吗？

【问题讨论】：

【参考方案1】：

这是一个通用<table> 的工作示例。 （虽然没有使用您的页面，因为需要执行 javascript 来加载表数据）

按国家/地区从here GDP（国内生产总值）中提取表格数据。

from bs4 import BeautifulSoup as Soup
html = ... # read your html with urllib/requests etc.
soup = BeautifulSoup(html, parser='lxml')

htmltable = soup.find('table',  'class' : 'table table-striped' )
# where the dictionary specify unique attributes for the 'table' tag

下面的函数解析一个以标签<table>开头的html段，后跟多个<tr>（表格行）和内部<td>（表格数据）标签。它返回具有内列的行列表。第一行只接受一个<th>（表头/数据）。

def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td') ) # data row       
    return rows

使用它我们得到（前两行）。

list_table = tableDataText(htmltable)
list_table[:2]

[['Rank',
  'Name',
  "GDP (IMF '19)",
  "GDP (UN '16)",
  'GDP Per Capita',
  '2019 Population'],
 ['1',
  'United States',
  '21.41 trillion',
  '18.62 trillion',
  '$65,064',
  '329,064,917']]

这可以很容易地转换为 pandas.DataFrame 以进行更高级的操作。

import pandas as pd

dftable = pd.DataFrame(list_table[1:], columns=list_table[0])
dftable.head(4)

【讨论】：

【参考方案2】：

这不是你需要的具体代码，只是一个如何使用 BeautifulSoup 的演示。它找到id为“Table1”的表并获取其所有tr元素。

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1") 
rows = table.findAll(lambda tag: tag.name=='tr')

【讨论】：

真是太棒了，我不知道你可以通过 lambdas 来查找。确实很棒！查看你的 Facebook 邮箱，我给你发了消息。任何想法如何在没有要区分的 id 或标题时转到特定表...例如..我想要 html 文件中的第三个表...（没有其他指标）。您甚至可以在 lambda 中链接 find 命令（为了更好地过滤，因为我有多个表但它们没有 ID）！ table = soup.find(lambda tag: tag.name=='table' and tag.find(lambda ttag: ttag.name=='th' and ttag.text=='Common Name')) 仅供参考，“has_key”现已弃用。请改用 has_attr("id") 。我也会编辑原始回复。【参考方案3】：

soup = BeautifulSoup(HTML)

# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", "title":"TheTitle" )

rows=list()
for row in table.findAll("tr"):
   rows.append(row)

# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times

【讨论】：

任何想法如何在没有要区分的 id 或标题时转到特定表...例如..我想要 html 文件中的第三个表...（没有其他指标）。 @ihightower: soup.find('table')[2] 会给你第三个table。（不过，为了安全起见，您需要在执行此操作之前检查长度。）

以上是关于BeautifulSoup：获取特定表的内容的主要内容，如果未能解决你的问题，请参考以下文章