当 html 中没有类名时，从 html 表中获取数据

Posted 2023-02-18

技术标签:

【中文标题】当 html 中没有类名时，从 html 表中获取数据【英文标题】：Fetching data from html table, when no class name in html 【发布时间】：2015-11-16 03:28:44 【问题描述】：

我正在获取 html 格式的信息，n 必须存储它。通过在 python 中使用 beautifulsoup，我可以获得特定信息，但必须在过滤器中提及类名。但是我没有得到该表的任何类名。我想要这样的字典： "Product":"Choclate, Honey, Shampoo", "Quantity":"3, 1, 1", "Price":"45 , 32, 16"

示例 html 是这样的：产品数量价钱巧克力 3 ￥ 45.00 蜂蜜 2 32.00 卢比洗发水 1 16.00 卢比

【问题讨论】：

产品	数量	价格
巧克力	3	45.00卢比
亲爱的	2	32.00 卢比
洗发水	1	16.00 卢比

【参考方案1】：

你没有有给出一个类名。如果它是唯一的表，只需搜索 table 标记，否则您将不得不查看周围的 HTML 元素以及从 <body> 元素到该表的整个路径（如果有任何类或标识符或其他任何内容）出这个特定的表。如果这一切都失败了，您可能需要搜索包含单词 Product 的标题单元格，然后从那里找到<table> 元素。

由于我不知道周围的 HTML，我将展示用于搜索具有特定文本值的标题单元格的后备解决方案：

#!/usr/bin/env python
from __future__ import absolute_import, division, print_function
from pprint import pprint
from bs4 import BeautifulSoup


def main():
    with open('test.html') as html_file:
        soup = BeautifulSoup(html_file)

    header_row_node = soup.find('th', text='Product').parent
    headers = list(header_row_node.stripped_strings)
    header2values = dict((h, list()) for h in headers)
    for row_node in header_row_node.find_parent('table').tbody('tr'):
        product, quantity, price = row_node.stripped_strings
        price = price.split()[-1]  # Just take the number part.
        for header, value in zip(headers, [product, quantity, price]):
            header2values[header].append(value)

    result = dict((h, ', '.join(vs)) for h, vs in header2values.iteritems())
    pprint(result)



if __name__ == '__main__':
    main()

对于给定的测试数据（在将其保存为 test.html 之前，我稍微更正/完成了这些数据）打印：

u'Price': u'45.00, 32.00, 16.00',
 u'Product': u'Choclate, Honey, Shampoo',
 u'Quantity': u'3, 2, 1'

【讨论】：

以上是关于当 html 中没有类名时，从 html 表中获取数据的主要内容，如果未能解决你的问题，请参考以下文章

使用 laravel 模型获取数据时没有从表中获取结果

使用 Jquery 使用正则表达式从 HTML 表中获取所有数据

正则表达式捕获带有类名的 html 元素

通过在 MVC 中使用 Jquery 和 ajax 进行搜索来获取 html 表中的值

当元素包含多个类名时如何在硒中复制标签？

如何在网页开发中获取图层操作的信息？