我们如何组合来自相同 data_type 的数据帧的两列的值并获取每个元素的计数?
Posted
技术标签:
【中文标题】我们如何组合来自相同 data_type 的数据帧的两列的值并获取每个元素的计数?【英文标题】:How can we combine values from two columns of a dataframe of same data_type and get the count of each element? 【发布时间】:2020-05-15 16:31:28 【问题描述】:val data = Seq(
("India","Pakistan","India"),
("Australia","India","India"),
("New Zealand","Zimbabwe","New Zealand"),
("West Indies", "Bangladesh","Bangladesh"),
("Sri Lanka","Bangladesh","Bangladesh"),
("Sri Lanka","Bangladesh","Bangladesh"),
("Sri Lanka","Bangladesh","Bangladesh")
)
val df = data.toDF("Team_1","Team_2","Winner")
我有这个数据框。我想知道每支球队打了多少场比赛?
【问题讨论】:
【参考方案1】:上面讨论的答案有 3 种方法,我试图根据性能所花费/经过的时间来评估(仅用于教育/意识)......
import org.apache.log4j.Level
import org.apache.spark.sql.DataFrame, SparkSession
import org.apache.spark.sql.functions._
object Katu_37 extends App
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
val data = Seq(
("India", "Pakistan", "India"),
("Australia", "India", "India"),
("New Zealand", "Zimbabwe", "New Zealand"),
("West Indies", "Bangladesh", "Bangladesh"),
("Sri Lanka", "Bangladesh", "Bangladesh"),
("Sri Lanka", "Bangladesh", "Bangladesh"),
("Sri Lanka", "Bangladesh", "Bangladesh")
)
val df = data.toDF("Team_1", "Team_2", "Winner")
df.show
exec
println( "METHOD 1 ")
df.select("Team_1").union(df.select("Team_2")).groupBy("Team_1").agg(count("Team_1")).show()
exec
println( "METHOD 2 ")
df.select(array($"Team_1", $"Team_2").as("Team")).select("Team").withColumn("Team", explode($"Team")).groupBy("Team").agg(count("Team")).show()
exec
println( "METHOD 3 ")
val matchesCount = df.selectExpr("Team_1 as Teams").union(df.selectExpr("Team_2 as Teams"))
matchesCount.groupBy("Teams").count().withColumnRenamed("count","MatchesPlayed").show()
/**
*
* @param f
* @tparam T
* @return
*/
def exec[T](f: => T) =
val starttime = System.nanoTime()
println("t = " + f)
val endtime = System.nanoTime()
val elapsedTime = (endtime - starttime )
// import java.util.concurrent.TimeUnit
// val convertToSeconds = TimeUnit.MINUTES.convert(elapsedTime, TimeUnit.NANOSECONDS)
println("time Elapsed " + elapsedTime )
结果:
+-----------+----------+-----------+
| Team_1| Team_2| Winner|
+-----------+----------+-----------+
| India| Pakistan| India|
| Australia| India| India|
|New Zealand| Zimbabwe|New Zealand|
|West Indies|Bangladesh| Bangladesh|
| Sri Lanka|Bangladesh| Bangladesh|
| Sri Lanka|Bangladesh| Bangladesh|
| Sri Lanka|Bangladesh| Bangladesh|
+-----------+----------+-----------+
METHOD 1
+-----------+-------------+
| Team_1|count(Team_1)|
+-----------+-------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-------------+
t = ()
time Elapsed 2729302088
METHOD 2
+-----------+-----------+
| Team|count(Team)|
+-----------+-----------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-----------+
t = ()
time Elapsed 646513918
METHOD 3
+-----------+-------------+
| Teams|MatchesPlayed|
+-----------+-------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-------------+
t = ()
time Elapsed 988510662
我观察到org.apache.spark.sql.functions.array
方法所用的时间(646513918 纳秒)比union
方法少...
【讨论】:
如果有用,你想vote吗? 只是想知道你的 exec 和 spark.time 内置函数之间有什么区别吗? 几乎做同样的事情,但火花时间将以毫秒为单位,这将以纳秒为单位,您可以自定义。我从 spark 1.3(老俱乐部成员 :-))开始在 spark 中工作,我没有这种奢侈的 spark 时间,所以我过去常常遵循这种方式。【参考方案2】:val matchesCount = df.selectExpr("Team_1 as Teams").union(df.selectExpr("Team_2 as Teams"))
matchesCount.groupBy("Teams").count().withColumnRenamed("count","MatchesPlayed").show()
+-----------+--------------+
| Teams|MatchesPlayed|
+-----------+--------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+--------------+
【讨论】:
【参考方案3】:您可以使用带有 select 语句的联合或使用 org.apache.spark.sql.functions.array
中的数组 // METHOD 1
df.select("Team_1").union(df.select("Team_2")).groupBy("Team_1").agg(count("Team_1")).show()
// METHOD 2
df.select(array($"Team_1", $"Team_2").as("Team")).select("Team").withColumn("Team",explode($"Team")).groupBy("Team").agg(count("Team")).show()
使用select
语句和union
:
+-----------+-------------+
| Team_1|count(Team_1)|
+-----------+-------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-------------+
Time Elapsed : 1588835600
使用array
:
+-----------+-----------+
| Team|count(Team)|
+-----------+-----------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-----------+
Time Elapsed : 342103600
使用org.apache.spark.sql.functions.array
的性能更好。
【讨论】:
数组方法比联合方法花费的时间更少。检查我的答案以上是关于我们如何组合来自相同 data_type 的数据帧的两列的值并获取每个元素的计数?的主要内容,如果未能解决你的问题,请参考以下文章
R如何排列数据帧的所有行,以便在列表中返回所有可能的行组合?