BeautifulSoup 嵌套标签

Posted 2023-02-23

技术标签:

【中文标题】BeautifulSoup 嵌套标签【英文标题】：BeautifulSoup nested tags 【发布时间】：2011-06-03 15:36:41 【问题描述】：

我正在尝试使用 Beautifulsoup 解析 XML，但在尝试将“recursive”属性与 findall() 一起使用时碰壁了

我有一个非常奇怪的 xml 格式，如下所示：

<?xml version="1.0"?>
<catalog>
   <book>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
      <book>true</book>
   </book>
   <book>
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
      <book>false</book>
   </book>
 </catalog>

如您所见，书标签在书标签内重复，当我尝试执行以下操作时会导致错误：

from BeautifulSoup import BeautifulStoneSoup as BSS

catalog = "catalog.xml"


def open_rss():
    f = open(catalog, 'r')
    return f.read()

def rss_parser():
    rss_contents = open_rss()
    soup = BSS(rss_contents)
    items = soup.findAll('book', recursive=False)

    for item in items:
        print item.title.string

rss_parser()

如您所见，在我的 soup.findAll 中，我添加了 recursive=false，理论上它不会通过找到的项目递归，而是跳到下一个。

这似乎不起作用，因为我总是收到以下错误：

  File "catalog.py", line 17, in rss_parser
    print item.title.string
AttributeError: 'NoneType' object has no attribute 'string'

我确定我在这里做了一些愚蠢的事情，如果有人可以帮助我解决这个问题，我将不胜感激。

更改 html 结构不是一种选择，此代码需要执行良好，因为它可能会解析大型 XML 文件。

【问题讨论】：

【参考方案1】：

看来问题在于嵌套的book 标签。 BautifulSoup 有一组预定义的可以嵌套的标签 (BeautifulSoup.NESTABLE_TAGS)，但它不知道 book 可以嵌套，所以它很奇怪。

Customizing the parser，解释发生了什么以及如何子类化BeautifulStoneSoup 来自定义可嵌套标签。以下是我们如何使用它来解决您的问题：

from BeautifulSoup import BeautifulStoneSoup

class BookSoup(BeautifulStoneSoup):
  NESTABLE_TAGS = 
      'book': ['book']
  

soup = BookSoup(xml) # xml string omitted to keep this short
for book in soup.find('catalog').findAll('book', recursive=False):
  print book.title.string

如果我们运行它，我们会得到以下输出：

XML Developer's Guide
Midnight Rain

【讨论】：

你能给我举个例子吗？似乎无法让它工作！【参考方案2】：

soup.findAll('catalog', recursive=False) 将返回一个仅包含您的***“目录”标签的列表。由于它没有“标题”孩子，item.title 是 None。

改用soup.findAll("book") 或soup.find("catalog").findChildren()。

编辑：好的，问题不是我想的那样。试试这个：

BSS.NESTABLE_TAGS["book"] = []
soup = BSS(open("catalog.xml"))
soup.catalog.findChildren(recursive=False)

【讨论】：

.children 错误，通过 "book" 执行只会返回第一项我的错误，我已将其更正为 findChildren()。但是.findAll("book") 对我有用。您确定您使用的是 findAll，而不是 find？我的错，我最终在这里发布了错误的 XML。我稍微更新了它的结构，所以它完全代表了我所拥有的，以及为什么 soup.findAll("book" 对我不起作用 @Marcos：好的，我知道你做了什么。我在解决方案中添加了更多代码。 @Thomas 我会谨慎对待这样的属性。子类化更安全。【参考方案3】：

Beautifulsoup 又慢又死，改用 lxml :)

>>> from lxml import etree
>>> rss = open('/tmp/catalog.xml')
>>> items = etree.parse(rss).xpath('//book/title/text()')
>>> items
["XML Developer's Guide", 'Midnight Rain']
>>>

【讨论】：

3.2.0 于 6 周前发布。怎么死的？ @marcog：差不多。有3.1系列和3.0系列。 3.1 的开发停滞不前。 3.1 有严重的错误，但人们继续使用它，因为它似乎是最新版本。所以作者基本上将3.0分支中的最新版本重新发布为3.2。你可以在这里阅读更多：crummy.com/2010/11/21/0 无论如何，这些是不同的工具。 BeautifullSoup 非常适合在“野生”html 中读取损坏的 XML。在这种情况下，看起来 etree 可以毫无问题地使用。 lxml 也可以处理损坏的 xml

以上是关于BeautifulSoup 嵌套标签的主要内容，如果未能解决你的问题，请参考以下文章