如何解析 XML 并计算特定节点属性的实例？

Posted 2023-02-16

技术标签:

【中文标题】如何解析 XML 并计算特定节点属性的实例？【英文标题】：How to parse XML and count instances of a particular node attribute? 【发布时间】：2010-12-27 02:43:41 【问题描述】：

我在包含 XML 的数据库中有很多行，我正在尝试编写 Python 脚本来计算特定节点属性的实例。

我的树看起来像：

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

如何使用 Python 访问 XML 中的属性 "1" 和 "2"？

【问题讨论】：

相关：Python xml ElementTree from a string source? 【参考方案1】：

我建议ElementTree。同一API还有其他兼容的实现，如Python标准库本身的lxml、cElementTree；但是，在这种情况下，他们主要增加的是更快的速度——编程部分的易用性取决于 ElementTree 定义的 API。

首先从 XML 构建一个 Element 实例 root，例如使用XML 函数，或通过以下方式解析文件：

import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()

或ElementTree 中显示的许多其他方式中的任何一种。然后执行以下操作：

for type_tag in root.findall('bar/type'):
    value = type_tag.get('foobar')
    print(value)

还有类似的，通常非常简单的代码模式。

【讨论】：

您似乎忽略了 Python 附带的 xml.etree.cElementTree，并且在某些方面比 lxml 更快（“lxml 的 iterparse() 比 cET 中的稍慢”——来自的电子邮件lxml作者）。 ElementTree 工作并包含在 Python 中。虽然 XPath 支持有限，并且您不能遍历元素的父元素，这会减慢开发速度（尤其是如果您不知道这一点）。详情请见python xml query get parent。 lxml 增加的不仅仅是速度。它提供了对父节点、XML 源中的行号等信息的轻松访问，这些信息在多种情况下都非常有用。似乎 ElementTree 存在一些漏洞问题，这是来自文档的引用：

Warning The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

@Cristik 这似乎是大多数 xml 解析器的情况，请参阅XML vulnerabilities page。【参考方案2】：

minidom 是最快且非常直接的。

XML：

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

Python：

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

输出：

4
item1
item1
item2
item3
item4

【讨论】：

如何获得“item1”的值？例如：Value1 minidom 的文档在哪里？我只发现了这个，但没有：docs.python.org/2/library/xml.dom.minidom.html 我也很困惑，为什么它直接从文档的顶层找到item？如果您提供路径 (data->items) 会不会更干净？因为，如果您也有 data->secondSetOfItems，它也有名为 item 的节点，并且您只想列出两组 item 中的一组，该怎么办？请看***.com/questions/21124018/… 语法在这里不起作用你需要删除括号for s in itemlist: print(s.attributes['name'].value)【参考方案3】：

你可以使用BeautifulSoup:

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

【讨论】：

三年后使用 bs4，这是一个很好的解决方案，非常灵活，尤其是在源格式不正确的情况下 @YOU BeautifulStoneSoup 已弃用。只需使用BeautifulSoup(source_xml, features="xml") 又过了 3 年，我只是尝试使用 ElementTree 加载 XML，不幸的是它无法解析，除非我在某些地方调整源但 BeautifulSoup 立即工作而没有任何更改！ @andi 您的意思是“已弃用”。 “折旧”是指价值下降，通常是由于老化或正常使用造成的磨损。再过 3 年，现在 BS4 还不够快。需要年龄。寻找更快的解决方案【参考方案4】：

那里有很多选择。如果速度和内存使用是一个问题，cElementTree 看起来很棒。与简单地使用readlines 读取文件相比，它的开销非常小。

相关指标见下表，复制自cElementTree 网站：

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k

正如@jfs 所指出的，cElementTree 与 Python 捆绑在一起：

Python 2：from xml.etree import cElementTree as ElementTree。 Python 3：from xml.etree import ElementTree（自动使用加速的 C 版本）。

【讨论】：

使用 cElementTree 有什么缺点吗？这似乎是一个明智的选择。显然他们不想在 OS X 上使用该库，因为我花了超过 15 分钟试图找出从哪里下载它并且没有链接有效。缺乏文档会阻碍好的项目蓬勃发展，希望更多人能够意识到这一点。 @Stunner：它在 stdlib 中，也就是说，你不需要下载任何东西。在 Python 2 上：from xml.etree import cElementTree as ElementTree。在 Python 3 上：from xml.etree import ElementTree（自动使用加速的 C 版本） @mayhewsw 要想清楚如何有效地将ElementTree 用于特定任务，需要付出更多的努力。对于适合内存的文档，使用minidom 会容易得多，并且对于较小的 XML 文档也适用。【参考方案5】：

为了简单起见，我建议xmltodict。

它将您的 XML 解析为 OrderedDict；

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

【讨论】：

同意。如果您不需要 XPath 或任何复杂的东西，使用起来会简单得多（尤其是在解释器中）；对于发布 XML 而不是 JSON 的 REST API 很方便请记住 OrderedDict 不支持重复键。大多数 XML 都充满了相同类型的多个同级（例如，一个部分中的所有段落，或者您的栏中的所有类型）。所以这只适用于非常有限的特殊情况。 @TextGeek 在这种情况下，result["foo"]["bar"]["type"] 是所有 <type> 元素的列表，因此它仍然有效（尽管结构可能有点出乎意料）。自 2019 年以来没有更新我刚刚意识到自 2019 年以来没有更新。我们需要找到一个活跃的分叉。【参考方案6】：

lxml.objectify 真的很简单。

获取示例文本：

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

输出：

'1': 1, '2': 1

【讨论】：

count 使用默认键将每个项目的计数存储在字典中，因此您不必检查成员资格。您也可以尝试查看collections.Counter。【参考方案7】：

Python 有一个到 expat XML 解析器的接口。

xml.parsers.expat

它是一个非验证解析器，因此不会捕获错误的 XML。但是，如果您知道您的文件是正确的，那么这非常好，您可能会得到您想要的确切信息，并且您可以随时丢弃其余信息。

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4

【讨论】：

【参考方案8】：

只是为了增加另一种可能性，您可以使用 untangle，因为它是一个简单的 xml-to-python-object 库。这里有一个例子：

安装：

pip install untangle

用法：

您的 XML 文件（稍作改动）：

<foo>
   <bar name="bar_name">
      <type foobar="1"/>
   </bar>
</foo>

使用untangle访问属性：

import untangle

obj = untangle.parse('/path_to_xml_file/file.xml')

print obj.foo.bar['name']
print obj.foo.bar.type['foobar']

输出将是：

bar_name
1

关于 untangle 的更多信息可以在“untangle”中找到。

此外，如果您好奇，可以在“Python and XML”中找到用于处理 XML 和 Python 的工具列表。您还会看到之前的答案中提到了最常见的。

【讨论】：

是什么让 untangle 与 minidom 不同？我无法告诉你这两者之间的区别，因为我没有使用过 minidom。【参考方案9】：

我可能会建议declxml。

完全披露：我编写这个库是因为我正在寻找一种在 XML 和 Python 数据结构之间进行转换的方法，而无需使用 ElementTree 编写数十行命令式解析/序列化代码。

使用 declxml，您可以使用 处理器 以声明方式定义 XML 文档的结构以及如何在 XML 和 Python 数据结构之间进行映射。处理器用于序列化和解析以及基本级别的验证。

解析成 Python 数据结构很简单：

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.dictionary('bar', [
        xml.array(xml.integer('type', attribute='foobar'))
    ])
])

xml.parse_from_string(processor, xml_string)

产生输出：

'bar': 'foobar': [1, 2]

您也可以使用相同的处理器将数据序列化为 XML

data = 'bar': 
    'foobar': [7, 3, 21, 16, 11]


xml.serialize_to_string(processor, data, indent='    ')

产生以下输出

<?xml version="1.0" ?>
<foo>
    <bar>
        <type foobar="7"/>
        <type foobar="3"/>
        <type foobar="21"/>
        <type foobar="16"/>
        <type foobar="11"/>
    </bar>
</foo>

如果您想使用对象而不是字典，您可以定义处理器来将数据转换为对象以及从对象转换。

import declxml as xml

class Bar:

    def __init__(self):
        self.foobars = []

    def __repr__(self):
        return 'Bar(foobars=)'.format(self.foobars)


xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.user_object('bar', Bar, [
        xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
    ])
])

xml.parse_from_string(processor, xml_string)

产生以下输出

'bar': Bar(foobars=[1, 2])

【讨论】：

【参考方案10】：

这是一个使用cElementTree 的非常简单但有效的代码。

try:
    import cElementTree as ET
except ImportError:
  try:
    # Python 2.5 need to import a different module
    import xml.etree.cElementTree as ET
  except ImportError:
    exit_err("Failed to import cElementTree from any known place")      

def find_in_tree(tree, node):
    found = tree.find(node)
    if found == None:
        print "No %s in file" % node
        found = []
    return found  

# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
    dom = ET.parse(open(def_file, "r"))
    root = dom.getroot()
except:
    exit_err("Unable to open and parse input definition file: " + def_file)

# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")

这是来自“python xml parse”。

【讨论】：

【参考方案11】：

XML：

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

Python 代码：

import xml.etree.cElementTree as ET

tree = ET.parse("foo.xml")
root = tree.getroot() 
root_tag = root.tag
print(root_tag) 

for form in root.findall("./bar/type"):
    x=(form.attrib)
    z=list(x)
    for i in z:
        print(x[i])

输出：

foo
1
2

【讨论】：

【参考方案12】：

如果您使用python-benedict，则无需使用特定于库的 API。只需从您的 XML 初始化一个新实例并轻松管理它，因为它是 dict 子类。

安装很简单：pip install python-benedict

from benedict import benedict as bdict

# data-source can be an url, a filepath or data-string (as in this example)
data_source = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

data = bdict.from_xml(data_source)
t_list = data['foo.bar'] # yes, keypath supported
for t in t_list:
   print(t['@foobar'])

它支持并规范化多种格式的 I/O 操作：Base64、CSV、JSON、TOML、XML、YAML 和 query-string。

它在GitHub 上经过良好测试和开源。披露：我是作者。

【讨论】：

【参考方案13】：

xml.etree.ElementTree 与 lxml

这些是两个最常用的库的一些优点，在选择它们之前我会有所了解。

xml.etree.ElementTree:

标准库

lxml

XML声明

standalone="no"

漂亮的打印

缩进

Objectify

.node

sourceline

【讨论】：

【参考方案14】：

import xml.etree.ElementTree as ET
data = '''<foo>
           <bar>
               <type foobar="1"/>
               <type foobar="2"/>
          </bar>
       </foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
    print item.get('foobar')

这将打印foobar 属性的值。

【讨论】：

【参考方案15】：

simplified_scrapy: 一个新的lib，用过之后就爱上了。我推荐给你。

from simplified_scrapy import SimplifiedDoc
xml = '''
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
'''

doc = SimplifiedDoc(xml)
types = doc.selects('bar>type')
print (len(types)) # 2
print (types.foobar) # ['1', '2']
print (doc.selects('bar>type>foobar()')) # ['1', '2']

Here 是更多示例。这个库很容易使用。

【讨论】：

【参考方案16】：

#If the xml is in the form of a string as shown below then
from lxml  import etree, objectify
'''sample xml as a string with a name space http://xmlns.abc.com'''
message =b'<?xml version="1.0" encoding="UTF-8"?>\r\n<pa:Process xmlns:pa="http://xmlns.abc.com">\r\n\t<pa:firsttag>SAMPLE</pa:firsttag></pa:Process>\r\n'  # this is a sample xml which is a string


print('************message coversion and parsing starts*************')

message=message.decode('utf-8') 
message=message.replace('<?xml version="1.0" encoding="UTF-8"?>\r\n','') #replace is used to remove unwanted strings from the 'message'
message=message.replace('pa:Process>\r\n','pa:Process>')
print (message)

print ('******Parsing starts*************')
parser = etree.XMLParser(remove_blank_text=True) #the name space is removed here
root = etree.fromstring(message, parser) #parsing of xml happens here
print ('******Parsing completed************')


dict=
for child in root: # parsed xml is iterated using a for loop and values are stored in a dictionary
    print(child.tag,child.text)
    print('****Derving from xml tree*****')
    if child.tag =="http://xmlns.abc.comfirsttag":
        dict["FIRST_TAG"]=child.text
        print(dict)


### output
'''************message coversion and parsing starts*************
<pa:Process xmlns:pa="http://xmlns.abc.com">

    <pa:firsttag>SAMPLE</pa:firsttag></pa:Process>
******Parsing starts*************
******Parsing completed************
http://xmlns.abc.comfirsttag SAMPLE
****Derving from xml tree*****
'FIRST_TAG': 'SAMPLE''''

【讨论】：

还请提供一些上下文来解释您的答案如何解决问题。不鼓励仅使用代码回答。【参考方案17】：

如果您不想使用任何外部库或第三方工具，请尝试以下代码。

这会将xml 解析为python dictionary 这也将解析 xml 属性这也将解析像<tag/>这样的空标签和像<tag var=val/>这样只有属性的标签

代码

import re

def getdict(content):
    res=re.findall("<(?P<var>\S*)(?P<attr>[^/>]*)(?:(?:>(?P<val>.*?)</(?P=var)>)|(?:/>))",content)
    if len(res)>=1:
        attreg="(?P<avr>\S+?)(?:(?:=(?P<quote>['\"])(?P<avl>.*?)(?P=quote))|(?:=(?P<avl1>.*?)(?:\s|$))|(?P<avl2>[\s]+)|$)"
        if len(res)>1:
            return [i[0]:["@attributes":[j[0]:(j[2] or j[3] or j[4]) for j in re.findall(attreg,i[1].strip())],"$values":getdict(i[2])] for i in res]
        else:
            return res[0]:["@attributes":[j[0]:(j[2] or j[3] or j[4]) for j in re.findall(attreg,res[1].strip())],"$values":getdict(res[2])]
    else:
        return content

with open("test.xml","r") as f:
    print(getdict(f.read().replace('\n','')))

示例输入

<details class="4b" count=1 boy>
    <name type="firstname">John</name>
    <age>13</age>
    <hobby>Coin collection</hobby>
    <hobby>Stamp collection</hobby>
    <address>
        <country>USA</country>
        <state>CA</state>
    </address>
</details>
<details empty="True"/>
<details/>
<details class="4a" count=2 girl>
    <name type="firstname">Samantha</name>
    <age>13</age>
    <hobby>Fishing</hobby>
    <hobby>Chess</hobby>
    <address current="no">
        <country>Australia</country>
        <state>NSW</state>
    </address>
</details>

输出 （美化）

[
  
    "details": [
      
        "@attributes": [
          
            "class": "4b"
          ,
          
            "count": "1"
          ,
          
            "boy": ""
          
        ]
      ,
      
        "$values": [
          
            "name": [
              
                "@attributes": [
                  
                    "type": "firstname"
                  
                ]
              ,
              
                "$values": "John"
              
            ]
          ,
          
            "age": [
              
                "@attributes": []
              ,
              
                "$values": "13"
              
            ]
          ,
          
            "hobby": [
              
                "@attributes": []
              ,
              
                "$values": "Coin collection"
              
            ]
          ,
          
            "hobby": [
              
                "@attributes": []
              ,
              
                "$values": "Stamp collection"
              
            ]
          ,
          
            "address": [
              
                "@attributes": []
              ,
              
                "$values": [
                  
                    "country": [
                      
                        "@attributes": []
                      ,
                      
                        "$values": "USA"
                      
                    ]
                  ,
                  
                    "state": [
                      
                        "@attributes": []
                      ,
                      
                        "$values": "CA"
                      
                    ]
                  
                ]
              
            ]
          
        ]
      
    ]
  ,
  
    "details": [
      
        "@attributes": [
          
            "empty": "True"
          
        ]
      ,
      
        "$values": ""
      
    ]
  ,
  
    "details": [
      
        "@attributes": []
      ,
      
        "$values": ""
      
    ]
  ,
  
    "details": [
      
        "@attributes": [
          
            "class": "4a"
          ,
          
            "count": "2"
          ,
          
            "girl": ""
          
        ]
      ,
      
        "$values": [
          
            "name": [
              
                "@attributes": [
                  
                    "type": "firstname"
                  
                ]
              ,
              
                "$values": "Samantha"
              
            ]
          ,
          
            "age": [
              
                "@attributes": []
              ,
              
                "$values": "13"
              
            ]
          ,
          
            "hobby": [
              
                "@attributes": []
              ,
              
                "$values": "Fishing"
              
            ]
          ,
          
            "hobby": [
              
                "@attributes": []
              ,
              
                "$values": "Chess"
              
            ]
          ,
          
            "address": [
              
                "@attributes": [
                  
                    "current": "no"
                  
                ]
              ,
              
                "$values": [
                  
                    "country": [
                      
                        "@attributes": []
                      ,
                      
                        "$values": "Australia"
                      
                    ]
                  ,
                  
                    "state": [
                      
                        "@attributes": []
                      ,
                      
                        "$values": "NSW"
                      
                    ]
                  
                ]
              
            ]
          
        ]
      
    ]
  
]

【讨论】：

这个方法不错，但是返回的结果不方便使用。【参考方案18】：

如果源是 xml 文件，就说这个示例

<pa:Process xmlns:pa="http://sssss">
        <pa:firsttag>SAMPLE</pa:firsttag>
    </pa:Process>

你可以试试下面的代码

from lxml import etree, objectify
metadata = 'C:\\Users\\PROCS.xml' # this is sample xml file the contents are shown above
parser = etree.XMLParser(remove_blank_text=True) # this line removes the  name space from the xml in this sample the name space is --> http://sssss
tree = etree.parse(metadata, parser) # this line parses the xml file which is PROCS.xml
root = tree.getroot() # we get the root of xml which is process and iterate using a for loop
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('')
    if i >= 0:
        elem.tag = elem.tag[i+1:]

dict=  # a python dictionary is declared
for elem in tree.iter(): #iterating through the xml tree using a for loop
    if elem.tag =="firsttag": # if the tag name matches the name that is equated then the text in the tag is stored into the dictionary
        dict["FIRST_TAG"]=str(elem.text)
        print(dict)

输出将是

'FIRST_TAG': 'SAMPLE'

【讨论】：

以上是关于如何解析 XML 并计算特定节点属性的实例？的主要内容，如果未能解决你的问题，请参考以下文章