在 Beautifulsoup 中执行 findAll() 时跳过特定元素的内容

Posted 2023-02-24

技术标签:

【中文标题】在 Beautifulsoup 中执行 findAll() 时跳过特定元素的内容【英文标题】：Skip content of specific element when doing findAll() in Beautifoul soup 【发布时间】：2016-09-25 22:22:46 【问题描述】：

例如：我想查找类“author”(soup.findall(class_='author')) 的元素的内容，但跳过在类“cmets”((soup.findall(class_='comments')) 的元素内搜索。

所以类“作者”但不在任何类“cmets”的元素内

在bs中可以做这样的事情吗？

示例 html：

<div class ='article'>
    <span class='author'> John doe</span> <h3>title</h3>
    (...)
    <div class='comments'>
        <div class='row'>
             <span class='author'>Whining anon</span>
             <div class='content'>
             (...)
             </div>
        </div>
    </div>
</div>

【问题讨论】：

BS 允许find_all to take in a function arg as a filter。我在 BS 方面没有太多经验，但你也许可以玩这个。我在路上。我需要soup.findall(class_='author').findParents() 并检查它们是否有“cmets 标签”。但现在我还没准备好去想它。如果到那时没人会的话，我明天会弄清楚。添加html示例 【参考方案1】：

我认为一种方法是使用 for 循环和 if 语句来使用 .parent 进行过滤。这可以根据您的需要进行清理，但它可以使用 item.parent['class'] 来获取包含的 div 类以进行比较。

from bs4 import BeautifulSoup

soup = BeautifulSoup(someHTML, 'html.parser')

results = soup.findAll(class_="author")

for item in results:
    if 'comments' in item.parent['class']:
        pass
    else:
        print item

或者作为一种理解：

clean_results = [item for item in results if 'comments' not in item.parent['class']]

【讨论】：

如果我想避免的元素不是直接父元素，我不确定它是否会起作用。就像在

&lt;div class='commments'&gt;&lt;div class='row'&gt;&lt;span class='author' &gt;Mark&lt;/span&gt;(...)&lt;/div&gt;&lt;/div&gt;

Function arg as filter from R Nar 评论结合.findParents() 似乎是一个更好的方法【参考方案2】：

def AuthorNotInComments(tag):
    c = tag.get('class')
    if not c:
        return False
    if 'author' in c:
        if tag.findParents(class_='comments'):
            return False
        return True

 soup.findAll(AuthorNotInComments)

或“不区分大小写的包含”版本：

def AuthorNotInComments(tag):
    c=tag.get('class')
    if not c:
        return False
    p=re.compile('author', re.IGNORECASE)
    str = " ".join(c)
    if p.match(str) and not tag.findParents(class_=re.compile('comments'),
    re.IGNORECASE):
        return True
    return False

soup.findAll(AuthorNotInComments)

我欢迎在代码等方面提出任何建议/清理。如果有人能弄清楚如何使其可重用，那就太好了——比如findAll(class_="test", not_under="junk")

【讨论】：

以上是关于在 Beautifulsoup 中执行 findAll() 时跳过特定元素的内容的主要内容，如果未能解决你的问题，请参考以下文章