使用 lxml.etree.iterparse 解析损坏的 XML

Posted 2023-02-16

技术标签:

【中文标题】使用 lxml.etree.iterparse 解析损坏的 XML【英文标题】：Parsing broken XML with lxml.etree.iterparse 【发布时间】：2011-01-22 02:38:12 【问题描述】：

我正在尝试以内存高效的方式使用 lxml 解析一个巨大的 xml 文件（即从磁盘延迟流式传输，而不是将整个文件加载到内存中）。不幸的是，该文件包含一些破坏默认解析器的错误 ascii 字符。如果我设置了recover=True，解析器就可以工作，但是iterparse 方法不采用recover 参数或自定义解析器对象。有谁知道如何使用 iterparse 解析损坏的 xml？

#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)

#how do I do the equivalent with iterparse?  (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from

感谢您的帮助！

编辑——这是我遇到的编码错误类型的示例：

In [17]: data
Out[17]: '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

In [18]: lxml.etree.from
lxml.etree.fromstring      lxml.etree.fromstringlist  

In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError                            Traceback (most recent call last)

/mnt/articles/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()

XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190

In [19]: chardet.detect(data)
Out[19]: 'confidence': 1.0, 'encoding': 'ascii'

如您所见，chardet 认为它是一个 ascii 文件，但在这个示例中间有一个“\x1e”，这使得 lxml 引发异常。

【问题讨论】：

最简单的改变可能是在xml声明中设置字符编码类型。你试过吗？ “坏 unicode”是什么意思？您是否使用了正确的编码？数据来自 mysql 转储。我不知道编码是什么。我怎样才能知道？另见How to parse invalid (bad / not well-formed) XML? 在我的 Python 3.6 版本中，函数 lxml.etree.iterparse 有参数 recover。这解决了我的问题： lxml.etree.iterparse(xml_filename, events=("end", "start"), recover=True) 【参考方案1】：

编辑：

这是一个较旧的答案，我今天会采取不同的做法。而且我不仅仅指的是愚蠢的蛇……从那时起，BeutifulSoup4 就可用了，它真的非常好。我推荐给任何在这里绊倒的人。

目前接受的答案是，嗯，不是应该做什么。这个问题本身也有一个不好的假设：

parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.

实际上recover=True 是为了从格式错误的 XML 中恢复。但是，有一个 "encoding" option 可以解决您的问题。

parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
                              recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
                              )

就是这样，这就是解决方案。

顺便说一句 -- 适合任何在 python 中解析 XML 的人，尤其是来自第三方来源的 XML。我知道，我知道，文档很糟糕，而且有很多红鲱鱼；很多不好的建议。

lxml.etree.fromstring()? - 那是完美形成的 XML，傻 BeautifulStoneSoup？ - 速度慢，并且对自己有一个愚蠢的策略结束标签 lxml.etree.HTMLParser()? - （因为 xml 已损坏）这是一个秘密 - htmlParser() 是...一个具有 recover=True 的解析器 lxml.html.soupparser? - 编码检测应该更好，但它与 BeautifulSoup 的自关闭标签相同。也许您可以将 XMLParser 与 BeautifulSoup 的 UnicodeDammit 结合起来 UnicodeDammit 和其他 cockamamie 东西来修复编码？ - 嗯，UnicodeDammit 有点可爱，我喜欢这个名字，它对 xml 之外的东西很有用，但如果你用 XMLParser 做正确的事情，事情通常会得到解决()

您可以尝试网上提供的各种东西。 lxml 文档可能会更好。上面的代码是 90% 的 XML 解析案例所需要的。在这里我重申一下：

magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

不客气。我的头疼==你的理智。此外，它还具有您可能需要的其他功能，您知道，XML。

【讨论】：

-1 是什么让您认为 encoding='utf-8' 会解决他的问题？他报告说他的数据包含“他们喜欢的一些歌曲”...\x1e 是一个有效的 ASCII 字符，因此是一个有效的 UTF-8 字符。他的问题是\x1e 不是 XML 中的有效字符。 lxml.etree.fromstring() 似乎很高兴在 v3.3.0 中传递损坏的 XML，并从 v3.5.0 开始变得严格。 lxml.html.fromstring() 怎么样？你没有提到它。我来自 2018 年。很好的答案。该死的 XML。像 Element Tree 这样的标准库怎么样？【参考方案2】：

我通过创建一个具有类似文件的对象接口的类解决了这个问题。类的 read() 方法从文件中读取一行，并在将行返回给 iterparse 之前替换任何“坏字符”。

#psudo code

class myFile(object):
    def __init__(self, filename):
        self.f = open(filename)

    def read(self, size=None):
        return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')


#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')

我不得不多次编辑 myFile 类，为导致 lxml 阻塞的其他一些字符添加更多 replace() 调用。我认为 lxml 的 SAX 解析也可以正常工作（似乎支持恢复选项），但这个解决方案就像一个魅力！

【讨论】：

【参考方案3】：

编辑您的问题，说明发生了什么（确切的错误消息和回溯（复制/粘贴，不要从内存中输入）），让您认为“坏 unicode”是问题所在。

获取chardet 并将其提供给您的 MySQL 转储。告诉我们它是怎么说的。

向我们展示转储的前 200 到 300 个字节，例如使用print repr(dump[:300])

更新你写了 """你可以看到，chardet 认为它是一个 ascii 文件，但是在这个例子的中间有一个 "\x1e" 这使得 lxml 提高了一个异常。"""

我在这里没有看到“坏 unicode”。

chardet 是正确的。是什么让您认为“\x1e”不是 ASCII？它是一个ASCII字符，一个名为“RECORD SEPARATOR”的C0控制字符。

错误消息显示您的字符无效。这也是正确的。在 XML 中有效的唯一控制字符是 "\t"、"\r" 和 "\n"。 MySQL 应该对此抱怨和/或为您提供一种逃避它的方法，例如_x001e_（哇！）

鉴于上下文，看起来该字符可以被删除而不会丢失。您可能希望修复您的数据库，或者您可能希望从转储中删除此类字符（在检查它们是否全部消失之后），或者您可能希望选择一种比 XML 更不挑剔和更少体积的输出格式。

更新 2 您可能想要用户 iterparse() 不是因为这是您的最终目标，而是因为您想节省内存。如果您使用 CSV 之类的格式，则不会出现内存问题。

更新 3 回应 @Purrell 的评论：

你自己试试吧，伙计。 paste.org/3280965

这是馅饼的内容；它值得保存：

from lxml.etree import etree

data = '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(data), magical_parser)

要让它运行，需要修复一个导入，并提供另一个。数据太可怕了。没有输出来显示结果。这是一个将数据缩减为基本要素的替代品。全部为有效 XML 字符的 5 条 ASCII 文本（不包括 &lt; 和 &gt;）被替换为 t1, ..., t5。冒犯的\x1e 旁边是t2 和t3。

[output wraps at column 80]
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> data = '<article>&lt;p&gt;t1&lt;/p&gt;&lt;p&gt;t2\x1et3&lt;/p&gt;&lt;p&gt;t4
&lt;/p&gt;&lt;p&gt;t5&lt;/p&gt;</article>'
>>> magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
>>> tree = etree.parse(StringIO(data), magical_parser)
>>> print(repr(tree.getroot().text))
'<p>t1</p><p>t2t3/ppt4/ppt5/p'

不是我所说的“恢复”；在坏字符之后，< 和 > 字符消失。

馅饼是为了回答我的问题“是什么让您认为 encoding='utf-8' 会解决他的问题？”。这是由“但是有一个“编码”选项可以解决您的问题的语句触发的。 但 encoding=ascii 产生相同的输出。省略编码参数也是如此。这不是编码问题。案例已结束。

【讨论】：

我已将该信息添加到我的问题中...感谢您的帮助！感谢您的信息！我仍然想尝试找出一种方法让 lxml.etree.iterparse 修复损坏的 XML，就像 lxml.etree.parse(filename, lxml.etree.XMLParser(recover=True)) 一样。跨度> @John Machin: RE: "Update 2" -- 是的，我正在使用 iterparse() 来节省内存。确实，在这种情况下，我可以将 MySQL 数据库转储为 CSV。这是一个非常实用的解决方案。但是，我仍然想知道如何使用 lxml 处理这个问题，因为处理大型 XML 文档是我将来可能要做的事情。你不觉得这是一个非常普遍的问题吗？ @ericw：如何用 lxml 处理这个问题：（1）你改变 lxml 代码做你想做的事（2）你付钱给别人（例如 lxml 作者）做你想做的事(3) 你坐在沙滩上，直到其他人改变 lxml 来做你想做的事——所有这些都取决于 lxml 是否可能/明智地被改变。处理大型 XML 文档的原则性解决方案是在尝试解析它们之前使它们成为有效的 XML；这样恢复就在您的控制之下。

以上是关于使用 lxml.etree.iterparse 解析损坏的 XML的主要内容，如果未能解决你的问题，请参考以下文章