在Python中解析多个xml文件
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在Python中解析多个xml文件相关的知识,希望对你有一定的参考价值。
我在这里遇到了问题。所以我想解析其中包含相同结构的多个xml文件。我已经能够获取每个文件的所有位置并将它们保存到三个不同的列表中,因为有三种不同类型的xml结构。现在我想创建三个函数(对于每个列表),它循环遍历列表并解析我需要的信息。不知怎的,我无法做到。这里的任何人都可以给我一个提示怎么做?
import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys
#### Get the location of each XML file and save them into a list ####
all_xml_list =[]
def locate(pattern,root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in fnmatch.filter(files,pattern):
yield os.path.join(path,filename)
for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
all_xml_list.append(files)
#### Create lists by GameDay Events ####
xml_GameDay_Player = [x for x in all_xml_list if 'Player' in x]
xml_GameDay_Team = [x for x in all_xml_list if 'Team' in x]
xml_GameDay_Match = [x for x in all_xml_list if 'Match' in x]
XML文件如下所示:
<sports-content xmlns:imp="url">
<sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
<sports-title>player-statistics-165483</sports-title>
</sports-metadata>
<sports-event>
<event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
<team>
<team-metadata id="O_17" team-key="17">
<name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
</team-metadata>
<player>
<player-metadata player-key="33201" uniform-number="1">
<name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
</player-metadata>
<player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
<rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
<rating rating-type="grade" rating-value="2.2" />
<rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
<rating rating-type="bemeister" rating-value="16.04086" />
<player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
<stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
<stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
<stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
</player-stats-soccer>
</player-stats>
</player>
</team>
</sports-event>
</sports-content>
我想提取“玩家元标记”和“玩家统计数据覆盖范围”和“玩家统计数据足球”标记内的所有内容。
改进@Gnudiff的答案,这是一种更有弹性的方法:
import os
from glob import glob
from lxml import etree
xml_GameDay = {
'Player': [],
'Team': [],
'Match': [],
}
# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
for key in xml_GameDay.keys():
if key in os.path.basename(filename):
xml_GameDay[key].append(filename)
break
def select_first(context, path):
result = context.xpath(path)
if len(result):
return result[0]
return None
# extract data from Player files
for filename in xml_GameDay['Player']:
tree = etree.parse(filename)
for player in tree.xpath('.//player'):
player_data = {
'key': select_first(player, './player-metadata/@player-key'),
'lastname': select_first(player, './player-metadata/name/@last'),
'firstname': select_first(player, './player-metadata/name/@first'),
'nickname': select_first(player, './player-metadata/name/@nickname'),
}
print(player_data)
# ...
XML文件可以有各种字节编码,并以XML声明作为前缀,声明了文件其余部分的编码。
<?xml version="1.0" encoding="UTF-8"?>
UTF-8是XML文件的常见编码(它也是默认的),但实际上它可以是任何东西。这是不可能预测的,并且对您的程序进行硬编码以期望某种编码是非常糟糕的做法。
XML解析器旨在以透明的方式处理这种特性,因此您不必担心它,除非您做错了。
这是做错的一个很好的例子:
# BAD CODE, DO NOT USE
def file_get_contents(filename):
with open(filename) as f:
return f.read()
tree = etree.XML(file_get_contents('some_filename.xml'))
这里发生的是:
- Python打开
filename
作为文本文件f
f.read()
返回一个字符串etree.XML()
解析该字符串并创建一个DOM对象tree
听起来不是那么错,是吗?但是如果XML是这样的:
<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>
那么你最终得到的DOM将是:
Player
@nickname="Mäxchen"
你刚刚销毁了数据。除非XML包含像ä
这样的“扩展”字符,否则你甚至都不会注意到这种方法是不可能的。这很容易被忽视。
打开XML文件只有一种正确的方法(它也比上面的代码更简单):将文件名提供给解析器。
tree = etree.parse('some_filename.xml')
这样,解析器可以在读取数据之前找出文件编码,而您不必关心这些细节。
对于您的特定情况,这不是一个完整的解决方案,因为这是一项任务,而且我没有键盘,在平板电脑上工作。
通常,您可以通过多种方式执行此操作,具体取决于您是否确实需要所有数据或提取特定子集,以及您是否事先知道所有可能的结构。
例如,一种方式:
from lxml import etree
Playerdata=[]
for F in xml_Gameday_Player:
tree=etree.XML(file_get_contents(F))
for player in tree.xpath('.//player'):
row=[]
row['player']=player.xpath('./player-metadata/name/@Last/text()')
for plrdata in player.xpath('.//player-stats'):
#do stuff with player data
Playerdata+=row
这是根据我现有的脚本改编的,但它更适合于仅提取xml的特定子集。如果您需要所有数据,那么使用某些xml树walker可能会更好。
file_get_contents是一个小帮手函数:
def file_get_contents(filename):
with open(filename) as f:
return f.read()
Xpath是一种用于在xml中查找节点的强大语言。请注意,根据您使用的Xpath,结果可能是“for player in ...”语句中的xml节点,也可能是“row ['player'] =”语句中的字符串。
以上是关于在Python中解析多个xml文件的主要内容,如果未能解决你的问题,请参考以下文章
Android 逆向使用 Python 解析 ELF 文件 ( Capstone 反汇编 ELF 文件中的机器码数据 | 创建反汇编解析器实例对象 | 设置汇编解析器显示细节 )(代码片段