在 Spark Scala 中加入后创建嵌套数据

Posted 2023-03-23

技术标签:

【中文标题】在 Spark Scala 中加入后创建嵌套数据【英文标题】：Create a nested data after join in Spark Scala 【发布时间】：2018-05-23 09:50:52 【问题描述】：

我的目标是在 spark/Hadoop 中准备一个数据框，我将在 elasticsearch 中对其进行索引。

我有 2 个兽人桌：client 和 person。关系是一对多的

1 个客户可以有多个人。

所以我将使用 Spark/Spark SQL，让我们来说说数据帧：

客户端数据框架构：

root 
|-- client_id: string (nullable = true) 
|-- c1: string (nullable = true) 
|-- c2: string (nullable = true) 
|-- c3: string (nullable = true)

人员数据框架构：

root 
|-- person_id: string (nullable = true) 
|-- p1: string (nullable = true) 
|-- p2: string (nullable = true) 
|-- p3: string (nullable = true) 
|-- client_id: string (nullable = true)

我的目标是生成一个具有此架构的数据框：

root 
|-- client_id: string (nullable = true) 
|-- c1: string (nullable = true) 
|-- c2: string (nullable = true) 
|-- c3: string (nullable = true) 
|-- persons: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- person_id: string (nullable = true) 
| | |-- p1: string (nullable = true) 
| | |-- p2: string (nullable = true) 
| | |-- p3: string (nullable = true)

我怎样才能做到这一点？

提前感谢您的帮助。

【问题讨论】：

加入然后***.com/questions/43357727/… 【参考方案1】：

您可以通过client_id group 和person 数据框创建所有其他columns 和join 的list 和client 数据框，如下所示

//client data 
val client = Seq(
  ("1", "a", "b", "c"),
  ("2", "a", "b", "c"),
  ("3", "a", "b", "c")
).toDF("client_id", "c1", "c2", "c2")

//person data 
val person = Seq(
  ("p1", "a", "b", "c", "1"),
  ("p2", "a", "b", "c", "1"),
  ("p1", "a", "b", "c", "2")
).toDF("person_id", "p1", "p2", "p3", "client_id")

//Group the person data by client_id and create a list of remaining columns 
val groupedPerson = person.groupBy("client_id")
  .agg(collect_list(struct("person_id", "p1", "p2", "p3")).as("persons"))


//Join the client and groupedPerson Data 
val resultDF = client.join(groupedPerson, Seq("client_id"), "left")

resultDF.show(false)

架构：

root
 |-- client_id: string (nullable = true)
 |-- c1: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- persons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- person_id: string (nullable = true)
 |    |    |-- p1: string (nullable = true)
 |    |    |-- p2: string (nullable = true)
 |    |    |-- p3: string (nullable = true)

输出：

+---------+---+---+---+------------------------+
|client_id|c1 |c2 |c2 |persons                 |
+---------+---+---+---+------------------------+
|1        |a  |b  |c  |[[p1,a,b,c], [p2,a,b,c]]|
|2        |a  |b  |c  |[[p1,a,b,c]]            |
|3        |a  |b  |c  |null                    |
+---------+---+---+---+------------------------+

希望这会有所帮助！

【讨论】：

以上是关于在 Spark Scala 中加入后创建嵌套数据的主要内容，如果未能解决你的问题，请参考以下文章