在 Spark Scala 中转置 DataFrame 列 [重复]
Posted
技术标签:
【中文标题】在 Spark Scala 中转置 DataFrame 列 [重复]【英文标题】:Transposing DataFrame columns in Spark Scala [duplicate] 【发布时间】:2019-02-23 03:48:59 【问题描述】:我发现很难在 DF 中转置列。 下面给出的是基本数据框和预期输出
Student Class Subject Grade
Sam 6th Grade Maths A
Sam 6th Grade Science A
Sam 7th Grade Maths A-
Sam 7th Grade Science A
Rob 6th Grade Maths A
Rob 6th Grade Science A-
Rob 7th Grade Maths A-
Rob 7th Grade Science B
Rob 7th Grade AP A
预期输出:
Student Class Math_Grade Science_Grade AP_Grade
Sam 6th Grade A A
Sam 7th Grade A- A
Rob 6th Grade A A-
Rob 7th Grade A- B A
请提出解决此问题的最佳方法。
【问题讨论】:
【参考方案1】:您可以通过Student, Class
和Subject
对DataFrame 进行group
,如下所示:
import org.apache.spark.sql.functions._
val df = Seq(
("Sam", "6th Grade", "Maths", "A"),
("Sam", "6th Grade", "Science", "A"),
("Sam", "7th Grade", "Maths", "A-"),
("Sam", "7th Grade", "Science", "A"),
("Rob", "6th Grade", "Maths", "A"),
("Rob", "6th Grade", "Science", "A-"),
("Rob", "7th Grade", "Maths", "A-"),
("Rob", "7th Grade", "Science", "B"),
("Rob", "7th Grade", "AP", "A")
).toDF("Student", "Class", "Subject", "Grade")
df.
groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).
orderBy("Student", "Class").
show
// +-------+---------+----+-----+-------+
// |Student| Class| AP|Maths|Science|
// +-------+---------+----+-----+-------+
// | Rob|6th Grade|null| A| A-|
// | Rob|7th Grade| A| A-| B|
// | Sam|6th Grade|null| A| A|
// | Sam|7th Grade|null| A-| A|
// +-------+---------+----+-----+-------+
【讨论】:
【参考方案2】:您只需使用pivot and group based on columns。
case class StudentRecord(Student: String, `Class`: String, Subject: String, Grade: String)
val rows = Seq(StudentRecord
("Sam", "6th Grade", "Maths", "A"),
StudentRecord
("Sam", "6th Grade", "Science", "A"),
StudentRecord
("Sam", "7th Grade", "Maths", "A-"),
StudentRecord
("Sam", "7th Grade", "Science", "A"),
StudentRecord
("Rob", "6th Grade", "Maths", "A"),
StudentRecord
("Rob", "6th Grade", "Science", "A-"),
StudentRecord
("Rob", "7th Grade", "Maths", "A-"),
StudentRecord
("Rob", "7th Grade", "Science", "B"),
StudentRecord
("Rob", "7th Grade", "AP", "A")
).toDF()
rows.groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).orderBy(desc("Student"), asc("Class")).show()
/**
* +-------+---------+----+-----+-------+
* |Student| Class| AP|Maths|Science|
* +-------+---------+----+-----+-------+
* | Sam|6th Grade|null| A| A|
* | Sam|7th Grade|null| A-| A|
* | Rob|6th Grade|null| A| A-|
* | Rob|7th Grade| A| A-| B|
* +-------+---------+----+-----+-------+
*/
【讨论】:
以上是关于在 Spark Scala 中转置 DataFrame 列 [重复]的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 python 在 Spark 中转置 DataFrame 而不进行聚合