使用 xPath 解析 xml 并提取属性值

Posted

技术标签:

【中文标题】使用 xPath 解析 xml 并提取属性值【英文标题】:xml parsing and extract attribute value with xPath 【发布时间】:2021-07-31 13:45:21 【问题描述】:

我想用 xPath 提取 N.1.2, N.1.1, N.2.r.1, ...., N.1.3, N.1.4

所以,我的字典中有 xpath。

# Value - Types of Message in batch
"N.1.1": R3Item(
    elemId="N.1.1",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/name[@codeSystem='2.16.840.1.113883.3.989.2.1.1.1']/@code",
    required=True,
    comment="N.1.1 - Types of Message in batch",
),
# Types of Message in batch
"N.1.1_csv": R3Item(
    elemId="N.1.1_csv",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/name[@codeSystem='2.16.840.1.113883.3.989.2.1.1.1']/@codeSystemVersion",
    required=True,
),
# Value - Batch Number
"N.1.2": R3Item(
    elemId="N.1.2",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/id[@root='2.16.840.1.113883.3.989.2.1.3.22']/@extension",
    required=True,
    comment="N.1.2 - Batch Number",
),
# Value - Batch Sender Identifier
"N.1.3": R3Item(
    elemId="N.1.3",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/sender[@typeCode='SND']/device[@classCode='DEV'][@determinerCode='INSTANCE']/id[@root='2.16.840.1.113883.3.989.2.1.3.13'][1]/@extension",
    required=True,
    comment="N.1.3 - Batch Sender Identifier",
),
# Value - Batch Receiver Identifier
"N.1.4": R3Item(
    elemId="N.1.4",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/receiver[@typeCode='RCV']/device[@classCode='DEV'][@determinerCode='INSTANCE']/id[@root='2.16.840.1.113883.3.989.2.1.3.14'][1]/@extension",
    required=True,
    comment="N.1.4 - Batch Receiver Identifier",
),
# Value - Date of Batch Transmission
"N.1.5": R3Item(
    elemId="N.1.5",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/creationTime/@value",
    required=True,
    comment="N.1.5 - Date of Batch Transmission",
),
# Value - Message Identifier
"N.2.r.1": R3Item(
    elemId="N.2.r.1",
    xPath="//PORR_IN049016UV[r]/id[@root='2.16.840.1.113883.3.989.2.1.3.1'][1]/@extension",
    required=True,
    comment="N.2.r.1 - Message Identifier",
),
# Value - Message Sender Identifier
"N.2.r.2": R3Item(
    elemId="N.2.r.2",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/PORR_IN049016UV[r]/sender[@typeCode='SND']/device[@classCode='DEV'][@determinerCode='INSTANCE']/id[@root='2.16.840.1.113883.3.989.2.1.3.11'][1]/@extension",
    required=True,
    comment="N.2.r.2 - Message Sender Identifier",
),
# Value - Message Receiver Identifier
"N.2.r.3": R3Item(
    elemId="N.2.r.3",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/PORR_IN049016UV[r]/receiver[@typeCode='RCV']/device[@classCode='DEV'][@determinerCode='INSTANCE']/id[@root='2.16.840.1.113883.3.989.2.1.3.12'][1]/@extension",
    required=True,
    comment="N.2.r.3 - Message Receiver Identifier",
),
# Value - Date of Message Creation
"N.2.r.4": R3Item(
    elemId="N.2.r.4",
    xPath="/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/PORR_IN049016UV[r]/creationTime/@value",
    required=True,
    comment="N.2.r.4 - Date of Message Creation",
),

下面是示例xml的一部分

<?xml version="1.0" encoding="UTF-8"?>
<MCCI_IN200100UV01 ITSVersion="XML_1.0" xsi:schemaLocation="urn:hl7-org:v3 MCCI_IN200100UV01.xsd" xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <id extension="N.1.2" root="2.16.840.1.113883.3.989.2.1.3.22"/>
    <creationTime value="N.1.5"/>
    <responseModeCode code="D"/>
    <interactionId extension="MCCI_IN200100UV01" root="2.16.840.1.113883.1.6"/>
    <name code="N.1.1" codeSystem="2.16.840.1.113883.3.989.2.1.1.1" codeSystemVersion="1.01"/>
    <PORR_IN049016UV>
        <id extension="N.2.r.1" root="2.16.840.1.113883.3.989.2.1.3.1"/>
        <creationTime value="N.2.r.4"/>
        <interactionId extension="PORR_IN049016UV" root="2.16.840.1.113883.1.6"/>
        <processingCode code="P"/>
        <processingModeCode code="T"/>
        <acceptAckCode code="AL"/>
        <receiver typeCode="RCV">
            <device classCode="DEV" determinerCode="INSTANCE">
                <id extension="N.2.r.3" root="2.16.840.1.113883.3.989.2.1.3.12"/>
            </device>
        </receiver>
    </PORR_IN049016UV>
    <receiver typeCode="RCV">
        <device classCode="DEV" determinerCode="INSTANCE">
            <id extension="N.1.4" root="2.16.840.1.113883.3.989.2.1.3.14"/>
        </device>
    </receiver>
    <sender typeCode="SND">
        <device classCode="DEV" determinerCode="INSTANCE">
            <id extension="N.1.3" root="2.16.840.1.113883.3.989.2.1.3.13"/>
        </device>
    </sender>
</MCCI_IN200100UV01>                                                

下面是我的代码,但结果是空列表。 我想像“N.1.1”一样提取

def extractData(tree):
    """r3 data extracted by xpath"""
    root = tree.getroot()
    keys = getList(R3_DATA)
    for key in keys:
        xPath = getxPath(key)
        print(root.xpath(xPath))

我应该如何解决这个问题或者我应该怎么做? 如果有其他库或示例代码可以做到这一点,你能告诉我吗?

【问题讨论】:

getvalue 返回键的字典路径 元素位于命名空间xmlns="urn:hl7-org:v3" 中,因此您的 XPath 评估代码需要考虑命名空间。 在一些较旧的 lxml 版本中,但不是最新版本,我认为您可以使用 root.xpath(xPath, namespaces = None : 'urn:hl7-org:v3' ) 出现错误(TypeError: empty namespace prefix is not supported in XPath )。 root.xpath("/MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/PORR_IN049016UV[1]/sender[@typeCode='SND' ]/device[@classCode='DEV'][@determinerCode='INSTANCE']/id[@root='2.16.840.1.113883.3.989.2.1.3.11'][1]/@extension",namespaces=无:'urn:hl7-org:v3', "xsi":"w3.org/2001/XMLSchema-instance") 每个路径中的所有元素都需要使用命名空间前缀 【参考方案1】:

如前所述,您的 xpath 需要命名空间。这是一个如何在 lxml 中使用命名空间的示例。注意 xpath 中的 u:x: 前缀。

In [1]: from lxml import etree

In [2]: root = etree.parse('mcci.xml')

In [3]: NS = 'u': 'urn:hl7-org:v3', 'x': 'http://www.w3.org/2001/XMLSchema-instance'

In [4:] xpath = "/u:MCCI_IN200100UV01[@ITSVersion='XML_1.0'][@x:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']/u:creationTime/@value"

In [5]: root.xpath(xpath, namespaces=NS)
Out[5]: ['N.1.5']

我可能会建议删除涉及架构位置的谓词以简化一些事情。

In [6]: NS = 'u': 'urn:hl7-org:v3'

In [7]: xpath = "/u:MCCI_IN200100UV01[@ITSVersion='XML_1.0']/u:creationTime/@value"

In [8]: root.xpath(xpath, namespaces=NS)
Out[8]: ['N.1.5']

【讨论】:

以上是关于使用 xPath 解析 xml 并提取属性值的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 XML::XPath 获取属性?

XML 解析 - 使用 KissXML 和 XPath 将属性分组到 nsdictionary

如何使用xmllint / xpath解析不同元素上的几个属性的值?

Hadoop pig XPath 返回空属性值

在Java中使用xpath对xml解析

从 xml 的元素中提取命名空间属性