在 Python 中使用 XPath 和 LXML

Posted 2023-02-23

技术标签:

【中文标题】在 Python 中使用 XPath 和 LXML【英文标题】：Using XPath in Python with LXML 【发布时间】：2017-03-29 20:29:55 【问题描述】：

我有一个用于解析 XML 并将某些感兴趣的元素导出到 csv 文件的 python 脚本。我现在尝试更改脚本以允许根据条件过滤 XML 文件，等效的 XPath 查询将是：

\DC\Events\Confirmation[contains(TransactionId,"GTEREVIEW")]

当我尝试使用 lxml 这样做时，我的代码是：

xml_file = lxml.etree.parse(xml_file_path)
namespace = "" + xml_file.getroot().nsmap[None] + ""
node_list = xml_file.findall(namespace + "Events/" + namespace + "Confirmation[TransactionId='*GTEREVIEW*']")

但这似乎不起作用。任何人都可以帮忙吗？ XML 文件示例：

<Events>
  <Confirmation>
    <TransactionId>GTEREVIEW2012</TransactionId>
  </Confirmation>    
  <Confirmation>
    <TransactionId>GTEDEF2012</TransactionId>
  </Confirmation>    
</Events>

所以我想要所有包含事务 ID 的“确认”节点，其中包含字符串“GTEREVIEW”。谢谢

【问题讨论】：

你的 xml 文件在哪里？我已经更新了问题。 【参考方案1】：

findall() 不支持 XPath 表达式，仅支持 ElementPath（请参阅 https://web.archive.org/web/20200504162744/http://effbot.org/zone/element-xpath.htm）。 ElementPath 不支持搜索包含某个字符串的元素。

为什么不使用 XPath？假设文件 test.xml 包含您的示例 XML，则以下工作：

> python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> from lxml import etree
>>> tree=etree.parse("test.xml")
>>> tree.xpath("Confirmation[starts-with(TransactionId, 'GTEREVIEW')]")
[<Element Confirmation at 0x7f68b16c3c20>]

如果你坚持使用findall()，你能做的最好的就是获取所有具有TransactionId 子节点的Confirmation 元素的列表：

>>> tree.findall("Confirmation[TransactionId]")
[<Element Confirmation at 0x7f68b16c3c20>, <Element Confirmation at 0x7f68b16c3ea8>]

然后您需要手动过滤此列表，例如：

>>> [e for e in tree.findall("Confirmation[TransactionId]")
     if e[0].text.startswith('GTEREVIEW')]
[<Element Confirmation at 0x7f68b16c3c20>]

如果您的文档包含命名空间，则以下内容将为您提供所有具有 TransactionId 子节点的 Confirmation 元素，前提是这些元素使用默认命名空间（我使用 xmlns="file:xyz" 作为默认命名空间）：

>>> tree.findall("//0Confirmation[0TransactionId]".format(tree.getroot().nsmap[None]))
[<Element file:xyzConfirmation at 0x7f534a85d1b8>, <Element file:xyzConfirmation at 0x7f534a85d128>]

当然还有etree.ETXPath：

>>> find=etree.ETXPath("//0Confirmation[starts-with(0TransactionId, 'GTEREVIEW')]".format(tree.getroot().nsmap[None]))
>>> find(tree)
[<Element file:xyzConfirmation at 0x7f534a85d1b8>]

这允许您组合 XPath 和命名空间。

【讨论】：

非常感谢您的回答！可悲的是，我的文档中涉及一个命名空间，导致 Xpath 返回一个空列表。从文件中删除命名空间后，代码似乎可以工作。有没有解决的办法？该文件基本上以 tradefinder.db.com/Schemas/MEL/CapitaHorizon_0_9_2.xsd" xmlns:xs="w3.org/2001/XMLSchema"> 开头并以结尾是这样想的。您仍然可以使用findall() 的第二种方法。您只需要对返回的节点列表进行过滤即可。【参考方案2】：

//Confirmation[TransactionId[contains(.,'GTEREVIEW')]]


father_tag[child_tag]  # select father_tag that has child_tag
[child_tag[filter]]    # select select child tag which match filter
[filter]

【讨论】：

以上是关于在 Python 中使用 XPath 和 LXML的主要内容，如果未能解决你的问题，请参考以下文章

Python lxml包下面的xpath基本用法

使用lxml的Python脚本，xpath返回空列表

第64天： XPath 和 lxml

Python Xpath：lxml.etree.XPathEvalError：无效谓词

无法在 xpath 中获取文本（lxml/python）