如何使用Python从不存在子根的XML数据中提取数据?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何使用Python从不存在子根的XML数据中提取数据?相关的知识,希望对你有一定的参考价值。

存在3个对象,每个对象的属性数量不同。如何在数据框中刮擦其每个属性?

<movie>'ABC'</movie>
<meta name="actor" content="Joseph"></meta>
<meta name="actor_ATTR" content="Male"></meta>
<meta name="actor_ATTR" content="32 Yrs"></meta>

<meta name="actor" content="Alex"></meta>
<meta name="actor_ATTR" content="Male"></meta>

<meta name="actor" content="John"></meta>
<meta name="actor_ATTR" content="Male"></meta>
<meta name="actor_ATTR" content="32 Yrs"></meta>
<meta name="actor_ATTR" content="3 awards"></meta>

依此类推

必需的输出数据框:

名称属性

约瑟夫[男,32岁]

Alex [男]

约翰[男,32岁,3个奖项]

答案

这是一个非常复杂的操作,因此我们将尽力解释,但要消化它并不容易:

from lxml import etree
import elementpath //the core element here is the xpath method intersect(); it's an xpath 2.0 function, so lxml (which only supports xpath 1.0, doesn't work here, so we need this library which supports xpath 2.0
import pandas as pd

movie = """[your xml above]"""
root = etree.XML(movie)

columns = ['Name','Sex','Age','Awards'] #prepare the dataframe columns
rows = [] #initialize the collection of information about actors
anchor = '@name="actor"' #this isn't strictly necessary, but because the xpath expressions get progressively convoluted, I believe this will make it more readable
actor_count = elementpath.select(root,f'count(//meta[{anchor}])') # how many actors are there? Note that this is the first, but not last, use of f-strings; you should read up on those as well
meta_count = elementpath.select(root,'count(//meta)') #how many items are there?

for c in range(actor_count): #for each actor
    row = [] #initialize a list containing data about this actor
    #the intersect() method start at top and looks down, than goes to the bottom and looks up, then selects what's in the middle; this is where it gets really complex, so you'll just have to read up on it
    top_down = f'//meta[{anchor}][{c+1}]/(self::meta,following-sibling::meta)' #note the use of {c+1} instead of just {c}; that's because the range() function is python, which counts from zero, while xpath counts from 1, so you need to account for that

    bottom_up = f'(//meta[{anchor}][preceding-sibling::meta[1][not({anchor})]]
    ,//meta[count(./preceding-sibling::*) = {meta_count}])[{c+1}]/(self::meta[not({anchor})],
    preceding-sibling::meta)'    

    src_exp = f'{top_down} intersect {bottom_up}'
    entries = elementpath.select(root,src_exp) 
    #if everything works, this should have separated the actors' data into separate groups
    for entry in entries:
        row.append(entry.attrib['content']) #add this actor's data to the actor's row

    if len(entries)<4:
        row += ['NA'] * (4 - len(entries)) #since some actors don't have all data items, the row for such an actor needs to be padded with 'NA's.
    rows.append(row) #add this actor's data to the general data pool
pd.DataFrame(rows,columns=columns) #load the whole thing into a dataframe

输出:

    Name    Sex     Age     Awards
0   Joseph  Male    32 Yrs  NA
1   Alex    Male    NA      NA
2   John    Male    32 Yrs  3 awards

以上是关于如何使用Python从不存在子根的XML数据中提取数据?的主要内容,如果未能解决你的问题,请参考以下文章

如何从不同级别的 JSON 对象中提取数据

如何提取XML文件中的数据?

从不在任何其他数字之前或之后的pandas字符串列中提取最多N位数

从不完整的视频文件中提取元数据

“从不”类型上不存在属性“地图”

Python - 在没有根的情况下向 xml 添加新元素?