如何使用 BeautifulSoup 找到评论标签 ？

Posted 2023-02-21

技术标签:

【中文标题】如何使用 BeautifulSoup 找到评论标签 ？【英文标题】：How to find the comment tag  with BeautifulSoup? 【发布时间】：2011-08-29 01:34:39 【问题描述】：

我尝试了 soup.find('!--') 但它似乎不起作用。提前致谢。

编辑：感谢您提供有关如何查找所有 cmets 的提示。我有一个后续问题。我如何专门搜索评论？

例如，我有以下评论标签：

我真的只是想要这些东西Wednesday 110518。 “110518”是我倾向于用作搜索目标的日期 YYMMDD。但是，我不知道如何在特定的评论标签中找到一些东西。

【问题讨论】：

【参考方案1】：

您可以通过findAll 方法找到文档中的所有cmets。请参阅此示例，该示例显示了如何准确地执行您正在尝试执行的操作Removing elements：

简而言之，你想要这个：

comments = soup.findAll(text=lambda text:isinstance(text, Comment))

编辑：如果您尝试在列中搜索，您可以尝试：

import re
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
for comment in comments:
  e = re.match(r'<i>([^<]*)</i>', comment.string).group(1)
  print e

【讨论】：

搜索特定评论怎么样？我正在尝试在 html 文件中搜索此内容：请注意 110518，这只是 yymmdd 中的日期，我如何仅搜索该评论标签中的信息，特别是仅在中的信息？ @1stsage 也许您想将该要求添加到您的问题中。 1stsage，针对您的具体情况更新了我的帖子。下次，请确保您的问题包含您想要做的事情。 @1stsage 关于搜索评论的内容，如果它是有效的 html，您也可以对其进行解析。或者您可以使用字符串方法甚至正则表达式。有了这么小的文本块和简单的要求，我会选择一个正则表达式（比如r'\<i\>(.*?)\</i\>'）。【参考方案2】：

Pyparsing 允许您使用内置的 htmlComment 表达式搜索 HTML cmets，并附加解析时回调以验证和提取注释中的各种数据字段：

from pyparsing import makeHTMLTags, oneOf, withAttribute, Word, nums, Group, htmlComment
import calendar

# have pyparsing define tag start/end expressions for the 
# tags we want to look for inside the comments
span,spanEnd = makeHTMLTags("span")
i,iEnd = makeHTMLTags("i")

# only want spans with class=titlefont
span.addParseAction(withAttribute(**'class':'titlefont'))

# define what specifically we are looking for in this comment
weekdayname = oneOf(list(calendar.day_name))
integer = Word(nums)
dateExpr = Group(weekdayname("day") + integer("daynum"))
commentBody = '<!--' + span + i + dateExpr("date") + iEnd

# define a parse action to attach to the standard htmlComment expression,
# to extract only what we want (or raise a ParseException in case 
# this is not one of the comments we're looking for)
def grabCommentContents(tokens):
    return commentBody.parseString(tokens[0])
htmlComment.addParseAction(grabCommentContents)


# let's try it
htmlsource = """
want to match this one
<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->

don't want the next one, wrong span class
<!-- <span class="bodyfont"> <i>Wednesday 110519</i>(05:00PM)<br /></span> -->

not even a span tag!
<!-- some other text with a date in italics <i>Wednesday 110520</i>(05:00PM)<br /></span> -->

another matching comment, on a different day
<!-- <span class="titlefont"> <i>Thursday 110521</i>(05:00PM)<br /></span> -->
"""

for comment in htmlComment.searchString(htmlsource):
    parsedDate = comment.date
    # date info can be accessed like elements in a list
    print parsedDate[0], parsedDate[1]
    # because we named the expressions within the dateExpr Group
    # we can also get at them by name (this is much more robust, and 
    # easier to maintain/update later)
    print parsedDate.day
    print parsedDate.daynum
    print

打印：

Wednesday 110518
Wednesday
110518

Thursday 110521
Thursday
110521

【讨论】：

最新版本的 pyparsing 现在包含 withClass 以简化 withAttribute 丑陋。

以上是关于如何使用 BeautifulSoup 找到评论标签 ？的主要内容，如果未能解决你的问题，请参考以下文章

当我在 html 中遇到评论时，如何停止使用 Beautifulsoup 提取 href 标签？

使用 BeautifulSoup 在评论标签中抓取表格

BeautifulSoup 模块未检测到任何标签

Beautiful Soup 跳过评论和脚本标签

如何从 BeautifulSoup4 中的 html 标签中找到特定的数据属性？

BeautifulSoup 在 findAll 中排除标签