如何使用pyspark从xml的每个嵌套节点创建一个表

Posted 2023-04-15

技术标签:

【中文标题】如何使用pyspark从xml的每个嵌套节点创建一个表【英文标题】：How to create a table from each nested node of an xml using pyspark 【发布时间】：2020-11-24 11:52:26 【问题描述】：

我有一个嵌套的 XML 结构如下-

<parent>
<root1 detail = "something">
    <ID type="typeA">id1</ID>
    <ID type="typeB">id2</ID>
    <ID type="typeC">id3</ID>
</root1>

<root2 detail = "something">
    <ID type="typeA">id1</ID>
    <ID type="typeB">id2</ID>
    <ID type="typeC">id3</ID>
</root2>
<parent>

我想用下面的列和数据创建2个表-

架构：

detail string
ID string
type string

记录：

detail        ID     type
something     id1   typeA
something     id2   typeB
something     id3   typeC

我尝试过使用

   spark.read.format(file_type) \
      .option("rootTag", "root1") \
      .option("rowTag", "ID") \
      .load(file_location)

但这只会产生描述（字符串）和 ID（数组）作为列。

提前致谢！

【问题讨论】：

我假设因为这个问题有你在 python 中工作的 pyspark 标记，但是它也被标记为 scala，所以你对答案的语言是否灵活？ 【参考方案1】：

看起来诀窍是在 StructField 中按名称（_VALUE 和 _TYPE）提取 ID 和 type，该字段位于从读取xml 文件：

from pyspark.sql.functions import explode, col

dfs = []

n = 2

for i in range(1,n+1):

    df = spark.read.format('xml') \
              .option("rowTag","root".format(i))\
              .load('file.xml')

    df = df.select([explode('ID'),'_detail'])\
           .withColumn('ID',col('col').getItem('_VALUE'))\
           .withColumn('type',col('col').getItem('_TYPE'))\
           .drop('col')\
           .withColumnRenamed('_detail','detail')
   
    dfs.append(df)
    
    df.show()

# +---------+---+-----+
# |   detail| ID| type|
# +---------+---+-----+
# |something|id1|typeA|
# |something|id2|typeB|
# |something|id3|typeC|
# +---------+---+-----+
# 
# +---------+---+-----+
# |   detail| ID| type|
# +---------+---+-----+
# |something|id1|typeA|
# |something|id2|typeB|
# |something|id3|typeC|
# +---------+---+-----+

如果您不想手动指定表的数量（由上面代码中的变量 n 控制），那么您可以先运行此代码：

from xml.etree import ElementTree

tree = ElementTree.parse("file.xml")
root = tree.getroot()

children = root.getchildren()

n = 0

for child in children:
    ElementTree.dump(child)
    n+=1

print("n = ".format(n))

# <root1 detail="something">
#     <ID type="typeA">id1</ID>
#     <ID type="typeB">id2</ID>
#     <ID type="typeC">id3</ID>
# </root1>
# 
# <root2 detail="something">
#     <ID type="typeA">id1</ID>
#     <ID type="typeB">id2</ID>
#     <ID type="typeC">id3</ID>
# </root2>
# n = 2

【讨论】：

其抛出“无法从 col#1225 中提取值：需要结构类型但得到字符串；”错误当你执行df = spark.read.format('xml').option("rowTag","root1").load('file.xml')（用你的文件名替换'file.xml'）然后df.printSchema()时你会得到什么？我得到：根|-- ID：数组（可为空=真）| |-- 元素：结构 (containsNull = true) | | |-- _VALUE：字符串（可为空=真）| | |-- _type: string (nullable = true) |-- _detail: string (nullable = true) 另外 .option("rowTag","root").option("rootTag","root") 我需要添加 rootTag 没有它给出错误“root tag must be present”跨度> 我得到了 root |-- ID: array (nullable = true) | |-- 元素: 字符串 (containsNull = true) |-- 细节: 字符串 (nullable = true) 得到了问题。我必须输入 maven 坐标并安装 com.databricks.cml-spark，否则会安装一些其他版本的 spark。现在它提供与您相同的架构

以上是关于如何使用pyspark从xml的每个嵌套节点创建一个表的主要内容，如果未能解决你的问题，请参考以下文章