xml.etree.ElementTree功能介绍

Posted 2021-03-30 lambda派

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了xml.etree.ElementTree功能介绍相关的知识，希望对你有一定的参考价值。

本文约2200字，建议阅读时间11~14分钟

关键字：python，xml，可扩展标记语言，xml.etree.ElementTree

假设当前工作目录下存在data.xml文件，其内容如下：

1、XML文件的解析

（1）读取xml文件

import xml.etree.ElementTree as ET

tree=ET.parse('data.xml')

#返回ElementTree对象

root=tree.getroot()

#返回Element对象

#亦可通过ET.fromstring(xml_string)来解析xml格式的根元素

（2）获取Element对象的标签

root.tag

#返回’data’

（3）获取Element对象的属性字典

root.attrib

#返回{}

（4）通过节点嵌套索引获取子节点信息

root[1][2].text

#返回’50’

注：建议通过API进行解析而不是通过xml文本，通过API解析的文档通常包含注释和处理指令。

2、非阻塞解析的Pull API

xml.etree.ElementTree模块提供的许多解析函数在返回结果前需要立刻读取整个文档，但有时我们需要在享受全文档带来便利的同时逐步解析xml文档而不产生阻塞，该模块提供的XMLPullParser.就是满足该条件的最有效的解析工具，它解析xml文档时不需要阻塞读取数据，而是通过feed方法来逐步读取，为了解析xml文档的元素，可以通过read_events方法来实现。

如果在读取xml数据不在乎阻塞且仍希望逐步获得逐步解析的能力，可以使用iterparser，它在读取大规模xml文档时且不想整体存入内存时很有用。

parser=ET.XMLPullParser(['start','end'])

parser.feed('<a>text1 ')

parser.feed('text2</a>')

for event,elem in parser.read_events():

print(event,elem.tag,elem.text)

#返回：

start a text1 text2

end a text1 text2

3、查找感兴趣的元素

（1）获取所有子树的元素

for friend in root.iter('friend'):

print(friend.attrib)

#返回：

{'name': 'Mary', 'relationship': 'lover'}

{'name': 'Peter', 'relationship': 'classmate'}

{'name': 'White', 'relationship': 'neighbour'}

（2）获取直接孩子的元素

for people in root.findall('people'):

name=people.get('name')

gender=people.find('gender').text

age=people.find('age').text

weight=people.find('weight').text

print(name,gender,age,weight)

#返回：

Jack male 30 65

Anna female 20 50

4、修改xml文件

ElementTree提供一种生成xml文档并将其写入（write()方法）文件的简单方法。一旦创建，Element对象的文本（Element.text）将可以直接改变，增加和修改属性（Element.set()方法），也可以增加新的孩子节点（Element.append()方法），移除某些节点（Element.remove()方法）.

修改Element对象的值

for age in root.iter('age'):

age=int(age.text)+1

age.text=str(age)

增加Element对象的属性

for age in root.iter('age'):

age.set('is_modified','yes')

移除Element对象的节点

for people in root.findall('people'):

gender=people.find('gender').text

if gender=='female':

root.remove(people)

注：当迭代时采用并行修改会像迭代python中列表和字典一样导致一些问题，所以一般采用findall()方法来匹配。

5、生成xml文件

SubElement()方法提供了一种简便创建子元素的方法。

x=ET.Element('x')

y=ET.SubElement(x,'y',attrib={'yy':'attr_y'})

z=ET.SubElement(x,'z')

m=ET.SubElement(z,'m',attrib={'mm':'attr_m'})

ET.dump(x)

#返回<x><y yy="attr_y" /><z><m mm="attr_m" /></z></x>

6、解析带命名空间的xml

xml文档中使用命名空间是为了解决处理不同文档时的命名冲突，常常在相同标签和属性前添加前缀，如prefix:tag、prefix:attr等，通常用于URI（统一资源标识符）的替换。

xml_text='''<?xml version="1.0"?>

<marks xmlns:other="http://specific.example.com"

xmlns="http://base.example.com">

<mark>

<other:example>something is best</other:example>

</mark>

<mark>

<location>right</location>

<other:example>something is better</other:example>

<other:example>something is good</other:example>

</mark>

</marks>

'''

root_s=ET.fromstring(xml_text)

ns={"base":"http://base.example.com",

"specific":"http://specific.example.com"}

for mark in root_s.findall('base:mark',ns):

location=mark.find('base:location',ns).text

for example in mark.findall('specific:example',ns):

print(f'{location}->{example.text}')

#返回：

left->something is best

right->something is better

right->something is good

#注：xmlns名称不能为其他，否则无法解析

7、xpath表达式

（1）xpath语法

tag：筛选出名称为tag的所有子元素

tag/tag1：筛选出名称为tag的所有子元素下名称为tag1的所有孙元素

{namespace}*：筛选出命名空间为namespace的所有元素

{*}tag：筛选出在任意命名空间或者无命名空间且名称为tag的元素

{}*：筛选出不在命名空间的元素

注：python3.8增加了星号通配符

*：筛选出所有子元素，包括注释与处理指令

.：筛选出当前节点

#常用于相对路径

//：筛选出当前元素下方所有后代元素

..：筛选出父元素

#开始元素的父元素将返回None

[@attr]：筛选出属性为attr的所有元素

[@attr=’value’]：筛选出属性为attr且值为value的所有元素

#注：value不能够包含引号

[tag]：筛选出子元素名称为tag的所有元素

[.=’text’]：筛选出当前元素及其后代文本为text的所有元素

[tag=’text’]：筛选出孩子名称为tag，且当前元素及其后代文本为text的所有元素

[position]：筛选出位置为position的所有元素

#position既可以是整数，也可以是表达式，如last()或者与表达式相关的形式，如last()-1

（2）例子

for child in root.findall('./people/friend'):

print(child.get('name'))

#返回：

Mary

Peter

White

for child in root.findall('.//friend[1]'):

print(child.get('name'))

#返回：

Mary

White

8、包含指令

xml.etree.ElementInclude模块可支持包含指令，该模块可被用于向树中添加子树和文本。在当前文件中导入XML或者txt文件，可通过{url}include元素并设置parse属性为xml或者text，并使用href属性来特指包含的文件。

xml_text1='''

<?xml version="1.0"?>

<other:include href="include.xml" parse="xml" />

</document>

'''

from xml.etree import ElementInclude

root_s1=ET.fromstring(xml_text1)

ElementInclude.include(root_s1)

假设include.xml的内容为’This is included content.’，那么root_s1将转换成如下内容：

<?xml version="1.0"?>

<para>This is included content.</para>

</document>

（完）

欢迎关注【lambda派】！