BeautifulSoup：从 html 获取 css 类

Posted 2023-02-23

技术标签:

【中文标题】BeautifulSoup：从 html 获取 css 类【英文标题】：BeautifulSoup: get css classes from html 【发布时间】：2012-07-15 02:54:35 【问题描述】：

有没有办法使用BeautifulSoup 从 html 文件中获取 CSS 类？示例 sn-p：

<style type="text/css">

 p.c3 text-align: justify

 p.c2 text-align: left

 p.c1 text-align: center

</style>

完美的输出应该是：

cssdict = 
    'p.c3': 'text-align': 'justify',
    'p.c2': 'text-align': 'left',
    'p.c1': 'text-align': 'center'

虽然这样可以：

L = [
    ('p.c3', 'text-align': 'justify'),  
    ('p.c2', 'text-align': 'left'),    
    ('p.c1', 'text-align': 'center')
]

【问题讨论】：

您期望得到什么？文字"\n\n p.c3 text-align: justify\n\n..."?请明确！ “获取 CSS 类”是指“获取样式表中选择器中使用的 HTML 类的列表”吗？即你想要的结果是['c3', 'c2', 'c1']? @Martin Pieters,@Quentin -- 更新了问题。所以你想要规则集，而不是类？您需要找到一个 CSS 解析器。我不认为 BeautifulSoup 具有这些方面的任何功能（它可以获取样式表，但不能解析它）。 @Quentin -- 规则集是的，我的问题被错误地提出了。对此感到抱歉。我不确定 this(cmets) 是否适合问这个问题，但是有推荐的 css 解析器吗？ 【参考方案1】：

BeautifulSoup 和 cssutils 组合可以很好地解决问题：

    from bs4 import BeautifulSoup as BSoup
    import cssutils
    selectors = 
    with open(htmlfile) as webpage:
        html = webpage.read()
        soup = BSoup(html, 'html.parser')
    for styles in soup.select('style'):
        css = cssutils.parseString(styles.encode_contents())
        for rule in css:
            if rule.type == rule.STYLE_RULE:
                style = rule.selectorText
                selectors[style] = 
                for item in rule.style:
                    propertyname = item.name
                    value = item.value
                    selectors[style][propertyname] = value

BeautifulSoup 解析 html 中的所有“样式”标签（head & body），.encode_contents() 将 BeautifulSoup 对象转换为 cssutils 可以读取的字节格式，然后 cssutils 将各个 CSS 样式一直解析到通过 rule.selectorText 和 rule.style 的属性/值级别。

注意：“rule.STYLE_RULE”只过滤掉样式。 cssutils documentation 详细介绍了过滤媒体规则、cmets 和导入的选项。

如果你把它分解成函数会更简洁，但你明白了要点......

【讨论】：

【参考方案2】：

BeautifulSoup 本身根本不解析 CSS 样式声明，但您可以提取这些部分，然后使用专用的 CSS 解析器对其进行解析。

根据您的需要，有几个可用于 python 的 CSS 解析器；我会选择cssutils（需要python 2.5或更高版本（包括python 3）），它支持最完整，也支持内联样式。

其他选项为css-py 和tinycss。

获取和解析所有样式部分（以 cssutils 为例）：

import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
    if not styletag.string: # probably an external sheet
        continue
    sheets.append(cssutils.parseStyle(styletag.string))

使用cssutil，您可以组合这些、解析导入，甚至让它获取外部样式表。

【讨论】：

【参考方案3】：

tinycss 解析器用于在 python 中显式解析 CSS。 BeautifulSoup 支持 HTML 标签，除非使用正则表达式，否则无法搜索特定的 css 类。这甚至支持一定数量的 CSS3。

http://packages.python.org/tinycss/

PS：但是，它只能从 python 2.6 开始工作。

【讨论】：

以上是关于BeautifulSoup：从 html 获取 css 类的主要内容，如果未能解决你的问题，请参考以下文章