如何计算单个 groupBy 中的总和和计数?
Posted
技术标签:
【中文标题】如何计算单个 groupBy 中的总和和计数?【英文标题】:How to calculate sum and count in a single groupBy? 【发布时间】:2016-11-06 12:05:17 【问题描述】:基于以下DataFrame
:
val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt")
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
| 1| A| 10|
| 2| A| 5|
| 3| B| 56|
+---+-----+----+
我想按类别获取ID个数和总金额:
+-----+-----+---------+
|Categ|count|sum(Amnt)|
+-----+-----+---------+
| B| 1| 56|
| A| 2| 15|
+-----+-----+---------+
是否可以在不进行连接的情况下进行计数和求和?
client.groupBy("Categ").count
.join(client.withColumnRenamed("Categ","cat")
.groupBy("cat")
.sum("Amnt"), 'Categ === 'cat)
.drop("cat")
也许是这样的:
client.createOrReplaceTempView("client")
spark.sql("SELECT Categ count(Categ) sum(Amnt) FROM client GROUP BY Categ").show()
【问题讨论】:
【参考方案1】:我给出的例子与你的不同
multiple group functions are possible like this. try it accordingly
// In 1.3.x, in order for the grouping column "department" to show up,
// it must be included explicitly as part of the agg function call.
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// In 1.4+, grouping column "department" is included automatically.
df.groupBy("department").agg(max("age"), sum("expense"))
import org.apache.spark.sql.DataFrame, SparkSession
import org.apache.spark.sql.functions._
val spark: SparkSession = SparkSession
.builder.master("local")
.appName("MyGroup")
.getOrCreate()
import spark.implicits._
val client: DataFrame = spark.sparkContext.parallelize(
Seq((1,"A",10),(2,"A",5),(3,"B",56))
).toDF("ID","Categ","Amnt")
client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()
+-----+---------+---------+
|Categ|sum(Amnt)|count(ID)|
+-----+---------+---------+
| B| 56| 1|
| A| 15| 2|
+-----+---------+---------+
【讨论】:
我们可以用 reduceBy 代替 groupBy 吗?【参考方案2】:您可以在给定的表上进行如下聚合:
client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()
+-----+---------+---------+
|Categ|sum(Amnt)|count(ID)|
+-----+---------+---------+
| A| 15| 2|
| B| 56| 1|
+-----+---------+---------+
【讨论】:
【参考方案3】:spark中有多种方法可以做聚合函数,
val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt")
1.
val aggdf = client.groupBy('Categ).agg(Map("ID"->"count","Amnt"->"sum"))
+-----+---------+---------+
|Categ|count(ID)|sum(Amnt)|
+-----+---------+---------+
|B |1 |56 |
|A |2 |15 |
+-----+---------+---------+
//Rename and sort as needed.
aggdf.sort('Categ).withColumnRenamed("count(ID)","Count").withColumnRenamed("sum(Amnt)","sum")
+-----+-----+---+
|Categ|Count|sum|
+-----+-----+---+
|A |2 |15 |
|B |1 |56 |
+-----+-----+---+
2.
import org.apache.spark.sql.functions._
client.groupBy('Categ).agg(count("ID").as("count"),sum("Amnt").as("sum"))
+-----+-----+---+
|Categ|count|sum|
+-----+-----+---+
|B |1 |56 |
|A |2 |15 |
+-----+-----+---+
3.
import com.google.common.collect.ImmutableMap;
client.groupBy('Categ).agg(ImmutableMap.of("ID", "count", "Amnt", "sum"))
+-----+---------+---------+
|Categ|count(ID)|sum(Amnt)|
+-----+---------+---------+
|B |1 |56 |
|A |2 |15 |
+-----+---------+---------+
//Use column rename is required.
4。如果你是 SQL 专家,你也可以这样做
client.createOrReplaceTempView("df")
val aggdf = spark.sql("select Categ, count(ID),sum(Amnt) from df group by Categ")
aggdf.show()
+-----+---------+---------+
|Categ|count(ID)|sum(Amnt)|
+-----+---------+---------+
| B| 1| 56|
| A| 2| 15|
+-----+---------+---------+
【讨论】:
以上是关于如何计算单个 groupBy 中的总和和计数?的主要内容,如果未能解决你的问题,请参考以下文章