如何将多行标签xml文件转换为数据框
Posted
技术标签:
【中文标题】如何将多行标签xml文件转换为数据框【英文标题】:how to convert multiple row tag xml files to dataframe 【发布时间】:2018-05-20 09:41:10 【问题描述】:我有一个包含多个行标签的 xml 文件。我需要将此 xml 转换为正确的数据框。我使用了 spark-xml,它只处理单行标签。
xml数据在下面
<?xml version='1.0' encoding='UTF-8' ?>
<generic
xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000">
<HEADER compName="ABGROUP" dateCreated="2018-03-09T09:38:51"/>
<COVERSHEET>
<ESTIMATE_INFO estimateName="2016-09-28-133907" priceList="YHTRDF" laborEff="Restoration/Service/Remodel" claimNumber="Hdchtdhtdh" policyNumber="Utfhtdhtd" typeOfLoss="Collapse" causeOfLoss="Collapse" roofDamage="0" deprMat="1" deprNonMat="1" deprRemoval="1" deprOandP="1" deprTaxes="1" estimateType="Mixed"/>
<ADDRESSES>
<ADDRESS type="Property" street="Pkwy" city="Lehi" state="UT" zip="0000" primary="1"/>
</ADDRESSES>
<CONTACTS>
<CONTACT type="ClaimRep" name="Vytvyfv"/>
<CONTACT type="Estimator" name="Vytvyfv"/>
</CONTACTS>
<DATES loss="2016-09-28T19:38:23Z" inspected="2016-09-28T19:39:27Z" completed="2018-03-09T09:38:49Z" received="2016-09-28T19:39:24Z" entered="2016-09-28T19:39:07Z" contacted="2016-09-28T19:39:26Z"/>
</COVERSHEET>
<COVERAGES>
<COVERAGE coverageName="Dwelling" coverageType="0" id="1"/>
<COVERAGE coverageName="Other Structures" coverageType="1" id="2"/>
<COVERAGE coverageName="Contents" coverageType="2" id="3"/>
</COVERAGES>
<LINE_ITEM_DETAIL>
<COV_BREAKDOWN>
<COV_AMOUNTS desc="Dwelling"/>
<COV_AMOUNTS desc="Other Structures"/>
<COV_AMOUNTS desc="Contents"/>
</COV_BREAKDOWN>
</LINE_ITEM_DETAIL>
<RECAP_BY_ROOM>
<RECAP_GROUP desc="2016-09-28-133907"/>
</RECAP_BY_ROOM>
</generic>
【问题讨论】:
你希望你的数据框是怎样的?可以提供样品吗? 不确定它应该是怎样的......尝试使用 spark-xml 来弄清楚。但它只处理单行标签....我希望它在正确的数据框中没有任何数据丢失。 您是否浏览了我在之前的回答中提供给您的链接? 是的,我找到了解析选项,但没有找到处理多个行标签的方法。 不是应该在一行吗?让我用我尝试过的方法来回答 【参考方案1】:我会建议您将其作为一个 rowTag(通用元素)阅读,然后根据您的需要进行分解
首先,元素的属性不能包含行分隔符所以
<generic
xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000">
应该是
<generic xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000">
上述修改完成后,您可以使用databricks xml as读取它
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rowTag", "generic") \
.option("valueTag", False) \
.load("path to xml file")
这应该给你
+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------+----------------------+-------------+-------------+--------------+-------------------------------+
|COVERAGES |COVERSHEET |HEADER |LINE_ITEM_DETAIL |RECAP_BY_ROOM |_majorVersion|_minorVersion|_transactionId|_xmlns |
+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------+----------------------+-------------+-------------+--------------+-------------------------------+
|[[[Dwelling, 0, 1,], [Other Structures, 1, 2,], [Contents, 2, 3,]]]|[[[Lehi, 1, UT, Pkwy, Property, 0,]], [[[Vytvyfv, ClaimRep,], [Vytvyfv, Estimator,]]], [2018-03-09T09:38:49Z, 2016-09-28T19:39:26Z, 2016-09-28T19:39:07Z, 2016-09-28T19:39:27Z, 2016-09-28T19:38:23Z, 2016-09-28T19:39:24Z,], [Collapse, Hdchtdhtdh, 1, 1, 1, 1, 1, 2016-09-28-133907, Mixed, Restoration/Service/Remodel, Utfhtdhtd, YHTRDF, 0, Collapse,]]|[ABGROUP, 2018-03-09T09:38:51,]|[[[[Dwelling,], [Other Structures,], [Contents,]]]]|[[2016-09-28-133907,]]|28 |300 |0 |http://xactware.com/generic.xsd|
+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------+----------------------+-------------+-------------+--------------+-------------------------------+
root
|-- COVERAGES: struct (nullable = true)
| |-- COVERAGE: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- _coverageName: string (nullable = true)
| | | |-- _coverageType: long (nullable = true)
| | | |-- _id: long (nullable = true)
| | | |-- false: string (nullable = true)
|-- COVERSHEET: struct (nullable = true)
| |-- ADDRESSES: struct (nullable = true)
| | |-- ADDRESS: struct (nullable = true)
| | | |-- _city: string (nullable = true)
| | | |-- _primary: long (nullable = true)
| | | |-- _state: string (nullable = true)
| | | |-- _street: string (nullable = true)
| | | |-- _type: string (nullable = true)
| | | |-- _zip: long (nullable = true)
| | | |-- false: string (nullable = true)
| |-- CONTACTS: struct (nullable = true)
| | |-- CONTACT: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _name: string (nullable = true)
| | | | |-- _type: string (nullable = true)
| | | | |-- false: string (nullable = true)
| |-- DATES: struct (nullable = true)
| | |-- _completed: string (nullable = true)
| | |-- _contacted: string (nullable = true)
| | |-- _entered: string (nullable = true)
| | |-- _inspected: string (nullable = true)
| | |-- _loss: string (nullable = true)
| | |-- _received: string (nullable = true)
| | |-- false: string (nullable = true)
| |-- ESTIMATE_INFO: struct (nullable = true)
| | |-- _causeOfLoss: string (nullable = true)
| | |-- _claimNumber: string (nullable = true)
| | |-- _deprMat: long (nullable = true)
| | |-- _deprNonMat: long (nullable = true)
| | |-- _deprOandP: long (nullable = true)
| | |-- _deprRemoval: long (nullable = true)
| | |-- _deprTaxes: long (nullable = true)
| | |-- _estimateName: string (nullable = true)
| | |-- _estimateType: string (nullable = true)
| | |-- _laborEff: string (nullable = true)
| | |-- _policyNumber: string (nullable = true)
| | |-- _priceList: string (nullable = true)
| | |-- _roofDamage: long (nullable = true)
| | |-- _typeOfLoss: string (nullable = true)
| | |-- false: string (nullable = true)
|-- HEADER: struct (nullable = true)
| |-- _compName: string (nullable = true)
| |-- _dateCreated: string (nullable = true)
| |-- false: string (nullable = true)
|-- LINE_ITEM_DETAIL: struct (nullable = true)
| |-- COV_BREAKDOWN: struct (nullable = true)
| | |-- COV_AMOUNTS: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _desc: string (nullable = true)
| | | | |-- false: string (nullable = true)
|-- RECAP_BY_ROOM: struct (nullable = true)
| |-- RECAP_GROUP: struct (nullable = true)
| | |-- _desc: string (nullable = true)
| | |-- false: string (nullable = true)
|-- _majorVersion: long (nullable = true)
|-- _minorVersion: long (nullable = true)
|-- _transactionId: long (nullable = true)
|-- _xmlns: string (nullable = true)
检查上述数据框,您可以通过执行以下操作来简化它
from pyspark.sql import functions as f
df.select(f.col('COVERAGES.COVERAGE'), f.col('COVERSHEET.ADDRESSES.ADDRESS.*'), f.col('COVERSHEET.CONTACTS.CONTACT'), f.col('COVERSHEET.DATES.*'), f.col('COVERSHEET.ESTIMATE_INFO.*'), f.col('HEADER.*'), f.col('LINE_ITEM_DETAIL.COV_BREAKDOWN.COV_AMOUNTS'), f.col('RECAP_BY_ROOM.RECAP_GROUP.*'), f.col('_majorVersion'), f.col('_minorVersion'), f.col('_transactionId'), f.col('_xmlns')).show(truncate=False)
它应该为您提供 带有架构的数据框,如下所示
+-----------------------------------------------------------------+-----+--------+------+-------+--------+----+-----+---------------------------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+------------+------------+--------+-----------+----------+------------+----------+-----------------+-------------+---------------------------+-------------+----------+-----------+-----------+-----+---------+-------------------+-----+-----------------------------------------------+-----------------+-----+-------------+-------------+--------------+-------------------------------+
|COVERAGE |_city|_primary|_state|_street|_type |_zip|false|CONTACT |_completed |_contacted |_entered |_inspected |_loss |_received |false|_causeOfLoss|_claimNumber|_deprMat|_deprNonMat|_deprOandP|_deprRemoval|_deprTaxes|_estimateName |_estimateType|_laborEff |_policyNumber|_priceList|_roofDamage|_typeOfLoss|false|_compName|_dateCreated |false|COV_AMOUNTS |_desc |false|_majorVersion|_minorVersion|_transactionId|_xmlns |
+-----------------------------------------------------------------+-----+--------+------+-------+--------+----+-----+---------------------------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+------------+------------+--------+-----------+----------+------------+----------+-----------------+-------------+---------------------------+-------------+----------+-----------+-----------+-----+---------+-------------------+-----+-----------------------------------------------+-----------------+-----+-------------+-------------+--------------+-------------------------------+
|[[Dwelling, 0, 1,], [Other Structures, 1, 2,], [Contents, 2, 3,]]|Lehi |1 |UT |Pkwy |Property|0 |null |[[Vytvyfv, ClaimRep,], [Vytvyfv, Estimator,]]|2018-03-09T09:38:49Z|2016-09-28T19:39:26Z|2016-09-28T19:39:07Z|2016-09-28T19:39:27Z|2016-09-28T19:38:23Z|2016-09-28T19:39:24Z|null |Collapse |Hdchtdhtdh |1 |1 |1 |1 |1 |2016-09-28-133907|Mixed |Restoration/Service/Remodel|Utfhtdhtd |YHTRDF |0 |Collapse |null |ABGROUP |2018-03-09T09:38:51|null |[[Dwelling,], [Other Structures,], [Contents,]]|2016-09-28-133907|null |28 |300 |0 |http://xactware.com/generic.xsd|
+-----------------------------------------------------------------+-----+--------+------+-------+--------+----+-----+---------------------------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+------------+------------+--------+-----------+----------+------------+----------+-----------------+-------------+---------------------------+-------------+----------+-----------+-----------+-----+---------+-------------------+-----+-----------------------------------------------+-----------------+-----+-------------+-------------+--------------+-------------------------------+
root
|-- COVERAGE: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _coverageName: string (nullable = true)
| | |-- _coverageType: long (nullable = true)
| | |-- _id: long (nullable = true)
| | |-- false: string (nullable = true)
|-- _city: string (nullable = true)
|-- _primary: long (nullable = true)
|-- _state: string (nullable = true)
|-- _street: string (nullable = true)
|-- _type: string (nullable = true)
|-- _zip: long (nullable = true)
|-- false: string (nullable = true)
|-- CONTACT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _name: string (nullable = true)
| | |-- _type: string (nullable = true)
| | |-- false: string (nullable = true)
|-- _completed: string (nullable = true)
|-- _contacted: string (nullable = true)
|-- _entered: string (nullable = true)
|-- _inspected: string (nullable = true)
|-- _loss: string (nullable = true)
|-- _received: string (nullable = true)
|-- false: string (nullable = true)
|-- _causeOfLoss: string (nullable = true)
|-- _claimNumber: string (nullable = true)
|-- _deprMat: long (nullable = true)
|-- _deprNonMat: long (nullable = true)
|-- _deprOandP: long (nullable = true)
|-- _deprRemoval: long (nullable = true)
|-- _deprTaxes: long (nullable = true)
|-- _estimateName: string (nullable = true)
|-- _estimateType: string (nullable = true)
|-- _laborEff: string (nullable = true)
|-- _policyNumber: string (nullable = true)
|-- _priceList: string (nullable = true)
|-- _roofDamage: long (nullable = true)
|-- _typeOfLoss: string (nullable = true)
|-- false: string (nullable = true)
|-- _compName: string (nullable = true)
|-- _dateCreated: string (nullable = true)
|-- false: string (nullable = true)
|-- COV_AMOUNTS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _desc: string (nullable = true)
| | |-- false: string (nullable = true)
|-- _desc: string (nullable = true)
|-- false: string (nullable = true)
|-- _majorVersion: long (nullable = true)
|-- _minorVersion: long (nullable = true)
|-- _transactionId: long (nullable = true)
|-- _xmlns: string (nullable = true)
现在您可以根据COVERAGE
或CONTACT
或COV_AMOUNTS
列将其转换为多行,因为它们是唯一可以分解为多行的列。
希望回答对你有帮助
【讨论】:
伟大的 ramesh,但我无法阅读它。它把我扔了,空数据框。我已经在 xml 文件中手动删除了行分隔符。你有没有用什么程序来删除? 我也手动删除了。尝试部分地追踪错误。我的意思是你可以试试只读的第一部分。 解决了这个错误。两件事,valueTag 会做什么?为什么我们使用'df.select'后跟列?那里到底发生了什么? 我正在复制粘贴它valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE.
,所以它创建了空值,所以我为此设置了 false。 df.select columns 只是展平 struct columns 。数组列不能那么容易被展平,所以我留下了它们,否则所有结构列都被展平以上是关于如何将多行标签xml文件转换为数据框的主要内容,如果未能解决你的问题,请参考以下文章