将 XML 文件中的所有元素解析为 CSV,无需硬编码值

Posted

技术标签:

【中文标题】将 XML 文件中的所有元素解析为 CSV,无需硬编码值【英文标题】:Parsing all elements in XML File to CSV without hardcoding values 【发布时间】:2021-02-10 06:57:06 【问题描述】:

我想知道是否有一种方法可以解析下面的 XML 并获取大部分标签,包括嵌套标签,并将它们放入列和行中而无需硬编码。

<?xml version="1.0" encoding="UTF-8"?>

<faults version="1" xmlns="urn:nortel:namespaces:mcp:faults" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:nortel:namespaces:mcp:faults NortelFaultSchema.xsd ">
    <family longName="1OffMsgr" shortName="OOM"/>
    <family longName="ACTAGENT" shortName="ACAT">
        <logs>
           <log>
                <eventType>RES</eventType>
                <number>1</number>
                <severity>INFO</severity>
                <descTemplate>
                     <msg>Accounting is enabled upon this NE.</msg>
               </descTemplate>
               <note>This log is generated when setting a Session Manager's AM from &lt;none&gt; to a valid AM.</note>
               <om>On all instances of this Session Manager, the &lt;NE_Inst&gt;:&lt;AM&gt;:STD:acct OM row in the  StdRecordStream group will appear and start counting the recording units sent to the configured AM.
                   On the configured AM, the &lt;NE_inst&gt;:acct OM rows in RECSTRMCOLL group will appear and start counting the recording units received from this Session Manager's instances.
               </om>
            </log>
           <log>
                <eventType>RES</eventType>
                <number>2</number>
                <severity>ALERT</severity>
                <descTemplate>
                     <msg>Accounting is disabled upon this NE.</msg>
               </descTemplate>
               <note>This log is generated when setting a Session Manager's AM from a valid AM to &lt;none&gt;.</note>
               <action>If you do not intend for the Session Manager to produce accounting records, then no action is required.  If you do intend for the Session Manager to produce accounting records, then you should set the Session Manager's AM to a valid AM.</action>
               <om>On all instances of this Session Manager, the &lt;NE_Inst&gt;:&lt;AM&gt;:STD:acct OM row in the StdRecordStream group that matched the previous datafilled AM will disappear.
                   On the previously configured AM, the  &lt;NE_inst&gt;:acct OM rows in RECSTRMCOLL group will disappear.
               </om>
            </log>
        </logs>
    </family>
    <family longName="ACODE" shortName="AC">
        <alarms>
            <alarm>
                <eventType>ADMIN</eventType>
                <number>1</number>
                <probableCause>INFORMATION_MODIFICATION_DETECTED</probableCause>
                <descTemplate>
                    <msg>Configured data for audiocode server updated: $1</msg>
                     <param>
                         <num>1</num>
                         <description>AudioCode configuration data got updated</description>
                         <exampleValue>acgwy1</exampleValue>
                     </param>
               </descTemplate>
               <manualClearable></manualClearable>
               <correctiveAction>None. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
               <alarmName>Audiocode Server Updated</alarmName>
               <severities>
                     <severity>MINOR</severity>
               </severities>               
            </alarm>
            <alarm>
                <eventType>ADMIN</eventType>
                <number>2</number>
                <probableCause>CONFIG_OR_CUSTOMIZATION_ERROR</probableCause>
                <descTemplate>
                    <msg>Deployment for audiocode server failed: $1. Reason: $2.</msg>
                     <param>
                         <num>1</num>
                         <description>AudioCode Name</description>
                         <exampleValue>audcod</exampleValue>
                     </param>
                     <param>
                         <num>2</num>
                         <description>AudioCode Deployment failed reason</description>
                         <exampleValue>Failed to parse audiocode configuration data</exampleValue>
                     </param>
               </descTemplate>
               <manualClearable></manualClearable>
               <correctiveAction>Check the configuration of audiocode server. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
               <alarmName>Audiocode Server Deploy Failed</alarmName>
               <severities>
                     <severity>MINOR</severity> 
                     <severity>MAJOR</severity>
               </severities>               
            </alarm>
            <alarm>
                <eventType>COMM</eventType>
                <number>2</number>
                <probableCause>LOSS_OF_FRAME</probableCause>
                <descTemplate>
                    <msg>Far end LOF (a.k.a., Yellow Alarm). Trunk (DS1 Number): $1.</msg>
                     <param>
                         <num>1</num>
                         <description>Trunk Number of Trunk with configuration problem</description>
                         <exampleValue>2</exampleValue>
                     </param>
               </descTemplate>
               <clearCondition>Far end is correctly configured for proper framing.</clearCondition>
               <correctiveAction>Check that the far end is configured for the proper framing.</correctiveAction>
               <alarmName>Far end LOF</alarmName>
               <severities>
                     <severity>CRITICAL</severity>
               </severities>
               <note>This alarm indicates the Trunk Framing settings on the connected PSTN switch do not match those provisioned on the Audiocodes Mediant 2k.</note>
            </alarm>
            <alarm>
                <eventType>COMM</eventType>
                <number>3</number>
                <probableCause>LOSS_OF_FRAME</probableCause>
                <descTemplate>
                    <msg>Near end sending LOF Indication. Trunk (DS1 Number): $1.</msg>
                     <param>
                         <num>1</num>
                         <description>Trunk Number of Trunk with configuration problem</description>
                         <exampleValue>2</exampleValue>
                     </param>
               </descTemplate>
               <clearCondition>Gateway is correctly configured for proper framing.</clearCondition>
               <correctiveAction>Check that the Audiocodes gateway is configured for the proper framing.</correctiveAction>
               <alarmName>Near end sending LOF Indication</alarmName>
               <severities>
                     <severity>CRITICAL</severity>
               </severities>               
            </alarm>
        </alarms>
        <logs>
           <log>
                <eventType>ABNORMAL</eventType>
                <number>1</number>
                <severity>ALERT</severity>
                <descTemplate>
                     <msg>Failed to deploy audiocode server. Server Name: $1, Failed At: $2</msg>
                     <param>
                         <num>1</num>
                         <description>IP address of gateway which failed.</description>
                         <exampleValue>192.168.0.1</exampleValue>
                     </param>
                     <param>
                         <num>2</num>
                         <description>One of the following: "Parse Configuration Data","Upload Tone File","Upload Load File" and "Upload Configuration File"</description>
                         <exampleValue>Parse Configuration Data</exampleValue>
                     </param>
               </descTemplate>
               <note>There was a problem during the commissioning/upgrade of a gateway.  Either the configuration file was corrupt or files could not be uploaded to the gateway.</note>
               <action>Examine the MCS logs as well as the syslogs from the gateway to determine what is causing the problem.</action>
            </log>
           <log>
                <eventType>ABNORMAL</eventType>
                <number>2</number>
                <severity>ALERT</severity>
                <descTemplate>
                     <msg>Failed to restart audiocode server. Server Name: $1. Exception caught: $2</msg>
                     <param>
                         <num>1</num>
                         <description>Server Long Name</description>
                         <exampleValue>audiocode_gateway_1</exampleValue>
                     </param>
                     <param>
                         <num>2</num>
                         <description>Exception occured during restarting the server.</description>
                         <exampleValue>[example Java exception traceback not given]</exampleValue>
                     </param>
               </descTemplate>
               <note>The AudioCodes Gateway was unable to be restarted due to a problem found in the INI file.</note>
               <action>Examine the configuration file and the syslogs of the gateway to determine what the configuration error is.  Correct this, then restart the server.</action>
            </log>
     </logs>
    </family>
</faults>

代码基本上是这样做的,但它没有获取 descTemplate 标记内的嵌套元素。我想找到一个有效的解决方案来解析所有元素,包括嵌套元素,而不需要硬编码(或尽可能少)。

进一步详细说明程序的作用:例如,如果我们查看我的 xml 中的 eventType 标签。它创建一个名为“eventType”的列,并将其中的值放在该列的下方。它解析的每个“eventType”标签都会放在同一列中。

在之前一个非常相似的问题中,tdelaney 慷慨地提供了这段代码,我还没有想出如何扩展来解决我的问题,所以我想我会再问一次 - 谢谢 tdelaney:

import csv
import lxml.etree
from lxml.etree import QName
import operator

class ExpandingTable:
    """A 2 dimensional table where columns are exapanded as new column
    types are discovered"""

    def __init__(self):
        """Create table that can expand rows and columns"""
        self.name_to_col = 
        self.table = []
    
    def add_column(self, name):
        """Add column named `name` unless already included"""
        if name not in self.name_to_col:
            self.name_to_col[name] = len(self.name_to_col)
            for row in self.table:
                row.append('')
    
    def add_cell(self, name, value):
        """Add value to named column in the current row"""
        if value:
            self.add_column(name)
            self.table[-1][self.name_to_col[name]] = value.strip().replace("\r\n", " ")
            
    def new_row(self):
        """Create a new row and make it current"""
        self.table.append([''] * len(self.name_to_col))

    def header(self):
        """Gather discovered column names into a header list"""
        idx_1 = operator.itemgetter(1)
        return [name for name, _ in sorted(self.name_to_col.items(), key=idx_1)]

    def prepend_header(self):
        """Gather discovered column names into a header and
        prepend it to the list"""
        self.table.insert(0, self.header())

def events_to_table(elem):
    """ Builds table from <family> child elements and their contained alarms and
    logs."""
    ns = "f":"urn:nortel:namespaces:mcp:faults"
    table = ExpandingTable()
    for family in elem.xpath("f:family", namespaces=ns):
        longName = family.get("longName")
        shortName = family.get("shortName")
        for event in family.xpath("*/*[f:eventType]", namespaces=ns):
            table.new_row()
            table.add_cell("longName", longName)
            table.add_cell("shortName", shortName)
            for cell in event:
                tag = QName(cell.tag).localname
                if tag == "severities":
                    tag = "severity"
                    text = ",".join(severity.text for severity in cell.xpath("*"))
                    print("severities", repr(text))
                else:
                    text = cell.text
                table.add_cell(tag, text)
    table.prepend_header()
    return table.table
    
def main(filename):
    doc = lxml.etree.parse(filename)
    table = events_to_table(doc.getroot())
    with open('test.csv', 'w', newline='', encoding='utf-8') as fileobj:
        csv.writer(fileobj).writerows(table)

main('OMGroups.xml')

任何帮助将不胜感激。

【问题讨论】:

【参考方案1】:

试试这个。

from simplified_scrapy import SimplifiedDoc, utils


def getKeyValues(nodeCols, dic, header):
    for nodeCol in nodeCols:
        childCols = nodeCol.children
        if childCols:
            getKeyValues(childCols, dic, header)
        else:
            tag = nodeCol.tag
            v = dic.get(tag)
            if v:  # Cases with multiple values
                dic[tag] = v + '|' + nodeCol.text # Splicing into 1 column
                # i = 1
                # while True:
                #     tag = tag + str(i)
                #     v = dic.get(tag)
                #     if v == None:
                #         dic[tag] = nodeCol.text
                #         break
                #     i = i + 1
            else:
                dic[tag] = nodeCol.text

            if tag not in header:
                header.append(tag)


xml = utils.getFileContent('OMGroups.xml')
doc = SimplifiedDoc(xml)  # create doc
header = ['longName','shortName','nodeType'] # add column
dicRow = []
# nodes = doc.faults.children.child
parentNodes = doc.faults.children.children # add
for nodes in parentNodes: # add
   for node in nodes:  # logs,alarms...
      if not node:
         continue
      family = node.parent
      longName = family['longName'] # get the value
      shortName = family['shortName']
      nodeRows = node.children
      for nodeRow in nodeRows:  # log,log...
         dicCol = 'longName': longName, 'shortName': shortName, 'nodeType': nodeRow.tag
         nodeCols = nodeRow.children  # eventType,number
         getKeyValues(nodeCols, dicCol, header)
         dicRow.append(dicCol)

# Prepare the data and store it in the csv file
rows = [header]
for dic in dicRow:
    rows.append([dic.get(k) for k in header])

utils.save2csv('test.csv', rows, newline='')

【讨论】:

非常感谢您,但我在想,您认为可以将每个 descTemplate 中的所有内容放入 1 个单元格中吗?所以就像所有标签msgnumdescriptionexample value 都被压缩到每个警报/日志的1 个单元格中?如果我可以在我的问题中获得原始代码来做到这一点,那将是完美的。 @marcorivera8 我改了答案,你看看是不是你想要的。 嘿@Yazz,非常接近,谢谢!但是我怎样才能进入 family 标签并获取 longName 和 shortName 并将它们放入前 2 列? @marcorivera8 dicCol = 'longName': longName + nodeRow.select('eventType>text()') ... @marcorivera8 没关系。我更新了答案。

以上是关于将 XML 文件中的所有元素解析为 CSV,无需硬编码值的主要内容,如果未能解决你的问题,请参考以下文章

Python将XML解析为缺少元素的CSV

如何将多个 XML 文件解析为多个 CSV 文件?

使用 XSLT 在文本文件 (CSV) 中解析 XML 文件

CSV到XML,数组字符串有问题

将数据存储在 SWF 文件中,无需解析它们

使用XSLT将XML转换为csv