Apache Spark将多行连接成单行列表[重复]

Posted

技术标签:

【中文标题】Apache Spark将多行连接成单行列表[重复]【英文标题】:Apache Spark concatenate multiple rows into list in single row [duplicate] 【发布时间】:2017-09-29 04:57:27 【问题描述】:

我需要从一个源表创建一个表(hive 表/spark 数据框),该表将多行用户的数据存储到单行列表中。

User table:
Schema:  userid: string | transactiondate:string | charges: string |events:array<struct<name:string,value:string>> 
----|------------|-------| ---------------------------------------
123 | 2017-09-01 | 20.00 | ["name":"chargeperiod","value":"this"]
123 | 2017-09-01 | 30.00 | ["name":"chargeperiod","value":"last"]
123 | 2017-09-01 | 20.00 | ["name":"chargeperiod","value":"recent"]
123 | 2017-09-01 | 30.00 | ["name":"chargeperiod","value":"0"]
456 | 2017-09-01 | 20.00 | ["name":"chargeperiod","value":"this"]
456 | 2017-09-01 | 30.00 | ["name":"chargeperiod","value":"last"]
456 | 2017-09-01 | 20.00 | ["name":"chargeperiod","value":"recent"]
456 | 2017-09-01 | 30.00 | ["name":"chargeperiod","value":"0"]

输出表应该是

userid:String | concatenatedlist :List[Row]
-------|-----------------
123    | [[2017-09-01,20.00,["name":"chargeperiod","value":"this"]],[2017-09-01,30.00,["name":"chargeperiod","value":"last"]],[2017-09-01,20.00,["name":"chargeperiod","value":"recent"]], [2017-09-01,30.00, ["name":"chargeperiod","value":"0"]]]
456    | [[2017-09-01,20.00,["name":"chargeperiod","value":"this"]],[2017-09-01,30.00,["name":"chargeperiod","value":"last"]],[2017-09-01,20.00,["name":"chargeperiod","value":"recent"]], [2017-09-01,30.00, ["name":"chargeperiod","value":"0"]]]

Spark 版本:1.6.2

【问题讨论】:

您使用哪种语言?斯卡拉? @moe 这不是这个问题的解决方案 【参考方案1】:
Seq(("1", "2017-02-01", "20.00", "abc"),
  ("1", "2017-02-01", "30.00", "abc2"),
  ("2", "2017-02-01", "20.00", "abc"),
  ("2", "2017-02-01", "30.00", "abc"))
.toDF("id", "date", "amt", "array")

df.withColumn("new", concat_ws(",", $"date", $"amt", $"array"))
  .select("id", "new")
  .groupBy("id")
  .agg(concat_ws(",", collect_list("new")))

【讨论】:

谢谢@hd16。 concat_ws 适用于 Array[String] 但不适用于 array>

以上是关于Apache Spark将多行连接成单行列表[重复]的主要内容,如果未能解决你的问题,请参考以下文章

将多行连接成单行并计算 SQL Server 中连接的行数

使用 LINQ 将多行连接成单行(CSV 属性)

在 MySQL 中将多行连接成单行

将包含多个值的多行连接成 MS Access 中的单行

SSIS 在不使用 SQL 的情况下将多行合并并连接成单行

Spark:如何将多行转换为具有多列的单行?