在 Spark Scala 中转置 DataFrame 列 [重复]

Posted

技术标签:

【中文标题】在 Spark Scala 中转置 DataFrame 列 [重复]【英文标题】:Transposing DataFrame columns in Spark Scala [duplicate] 【发布时间】:2019-02-23 03:48:59 【问题描述】:

我发现很难在 DF 中转置列。 下面给出的是基本数据框和预期输出

Student    Class         Subject      Grade    
Sam        6th Grade     Maths        A
Sam        6th Grade     Science      A
Sam        7th Grade     Maths        A-
Sam        7th Grade     Science      A
Rob        6th Grade     Maths        A
Rob        6th Grade     Science      A-
Rob        7th Grade     Maths        A-
Rob        7th Grade     Science      B
Rob        7th Grade     AP           A

预期输出:

Student Class        Math_Grade  Science_Grade  AP_Grade
Sam     6th Grade    A           A  
Sam     7th Grade    A-          A  
Rob     6th Grade    A           A- 
Rob     7th Grade    A-          B               A

请提出解决此问题的最佳方法。

【问题讨论】:

【参考方案1】:

您可以通过Student, ClassSubject 对DataFrame 进行group,如下所示:

import org.apache.spark.sql.functions._

val df = Seq(
  ("Sam", "6th Grade", "Maths", "A"),
  ("Sam", "6th Grade", "Science", "A"),
  ("Sam", "7th Grade", "Maths", "A-"),
  ("Sam", "7th Grade", "Science", "A"),
  ("Rob", "6th Grade", "Maths", "A"),
  ("Rob", "6th Grade", "Science", "A-"),
  ("Rob", "7th Grade", "Maths", "A-"),
  ("Rob", "7th Grade", "Science", "B"),
  ("Rob", "7th Grade", "AP", "A")
).toDF("Student", "Class", "Subject", "Grade")

df.
  groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).
  orderBy("Student", "Class").
  show
// +-------+---------+----+-----+-------+
// |Student|    Class|  AP|Maths|Science|
// +-------+---------+----+-----+-------+
// |    Rob|6th Grade|null|    A|     A-|
// |    Rob|7th Grade|   A|   A-|      B|
// |    Sam|6th Grade|null|    A|      A|
// |    Sam|7th Grade|null|   A-|      A|
// +-------+---------+----+-----+-------+

【讨论】:

【参考方案2】:

您只需使用pivot and group based on columns。

 case class StudentRecord(Student: String, `Class`: String, Subject: String, Grade: String)

 val rows = Seq(StudentRecord
  ("Sam", "6th Grade", "Maths", "A"),
  StudentRecord
  ("Sam", "6th Grade", "Science", "A"),
  StudentRecord
  ("Sam", "7th Grade", "Maths", "A-"),
  StudentRecord
  ("Sam", "7th Grade", "Science", "A"),
  StudentRecord
  ("Rob", "6th Grade", "Maths", "A"),
  StudentRecord
  ("Rob", "6th Grade", "Science", "A-"),
  StudentRecord
  ("Rob", "7th Grade", "Maths", "A-"),
  StudentRecord
  ("Rob", "7th Grade", "Science", "B"),
  StudentRecord
  ("Rob", "7th Grade", "AP", "A")
).toDF()

 rows.groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).orderBy(desc("Student"), asc("Class")).show()


 /**
  * +-------+---------+----+-----+-------+
  * |Student|    Class|  AP|Maths|Science|
  * +-------+---------+----+-----+-------+
  * |    Sam|6th Grade|null|    A|      A|
  * |    Sam|7th Grade|null|   A-|      A|
  * |    Rob|6th Grade|null|    A|     A-|
  * |    Rob|7th Grade|   A|   A-|      B|
  * +-------+---------+----+-----+-------+
  */

【讨论】:

以上是关于在 Spark Scala 中转置 DataFrame 列 [重复]的主要内容,如果未能解决你的问题,请参考以下文章

如何在Spark中转置数据框?

如何使用 python 在 Spark 中转置 DataFrame 而不进行聚合

Spark Scala 统计 Map Key 中字符串数组的出现次数

在查询中使用通配符在 Spark.SQL() 中转义字符

如何在 SQL 中转置查询结果?

如何在 hive 中转置/透视数据?