使用 spark-xml 从 pyspark 数据框中选择嵌套列

Posted 2023-04-15

技术标签:

【中文标题】使用 spark-xml 从 pyspark 数据框中选择嵌套列【英文标题】：Selecting nested columns from pyspark dataframe using spark-xml 【发布时间】：2018-06-15 05:47:33 【问题描述】：

我正在尝试从 Pyspark Dataframe 中选择嵌套的 ArrayType。

我只想从此数据框中选择项目列。我不知道我在这里做错了什么。

XML：

<?xml version="1.0" encoding="utf-8"?>
<shiporder orderid="str1234">
  <orderperson>ABC</orderperson>
  <shipto>
    <name>XYZ</name>
    <address>305, Ram CHowk</address>
    <city>Pune</city>
    <country>IN</country>
  </shipto>
  <items>
  <item>
    <title>Clothing</title>
    <notes>
        <note>Brand:CK</note>
        <note>Size:L</note>
    </notes>
    <quantity>6</quantity>
    <price>208</price>
  </item>
  </items>
</shiporder>

数据框的架构。

root
 |-- _orderid: string (nullable = true)
 |-- items: struct (nullable = true)
 |    |-- item: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- notes: struct (nullable = true)
 |    |    |    |    |-- note: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- quantity: long (nullable = true)
 |    |    |    |-- title: string (nullable = true)
 |-- orderperson: string (nullable = true)
 |-- shipto: struct (nullable = true)
 |    |-- address: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- name: string (nullable = true)




df.show(truncate=False)
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
|_orderid|items                                                                                        |orderperson  |shipto                         |
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
|str1234 |[[[[[color:Brown, Size:12]], 82.0, 1, Footwear], [[[Brand:CK, Size:L]], 208.0, 6, Clothing]]]|Vikrant Chand|[305, Giotto, Irvine, US, Amit]|
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+

当我选择项目列时，它返回 null。

df.select([ 'items']).show()
+-----+
|items|
+-----+
| null|
+-----+

虽然选择与shipto相同的列（其他嵌套列）解决了问题。

df.select([ 'items','shipto']).show()
+--------------------+--------------------+
|               items|              shipto|
+--------------------+--------------------+
|[[[[[color:Brown,...|[305, Giotto, Irv...|
+--------------------+--------------------+

【问题讨论】：

只使用df.select('items').show() 不带方括号。 @RameshMaharjan 我试过了。没有运气我尝试了两种方法，它对我有用，所以我不能说什么是错的我已经用 XML 数据集更新了这个问题，你能用同样的方法试试吗？我无法使用 pyspark 或 scala-spark 查看列值通过将 spark-xml 版本升级到 0.4.1 来修复它 【参考方案1】：

这是 spark-xml 中的一个错误，已在 0.4.1 中得到修复

Issue-193

【讨论】：

以上是关于使用 spark-xml 从 pyspark 数据框中选择嵌套列的主要内容，如果未能解决你的问题，请参考以下文章