我们如何组合来自相同 data_type 的数据帧的两列的值并获取每个元素的计数?

Posted

技术标签:

【中文标题】我们如何组合来自相同 data_type 的数据帧的两列的值并获取每个元素的计数?【英文标题】:How can we combine values from two columns of a dataframe of same data_type and get the count of each element? 【发布时间】:2020-05-15 16:31:28 【问题描述】:
val data = Seq(
  ("India","Pakistan","India"),
  ("Australia","India","India"),
  ("New Zealand","Zimbabwe","New Zealand"),
  ("West Indies", "Bangladesh","Bangladesh"),
  ("Sri Lanka","Bangladesh","Bangladesh"),
  ("Sri Lanka","Bangladesh","Bangladesh"),
  ("Sri Lanka","Bangladesh","Bangladesh")
)
val df = data.toDF("Team_1","Team_2","Winner")

我有这个数据框。我想知道每支球队打了多少场比赛?

【问题讨论】:

【参考方案1】:

上面讨论的答案有 3 种方法,我试图根据性能所花费/经过的时间来评估(仅用于教育/意识)......


import org.apache.log4j.Level
import org.apache.spark.sql.DataFrame, SparkSession
import org.apache.spark.sql.functions._

object Katu_37 extends App 
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)

  val spark = SparkSession.builder.appName(getClass.getName)
    .master("local[*]").getOrCreate


  import spark.implicits._

  val data = Seq(
    ("India", "Pakistan", "India"),
    ("Australia", "India", "India"),
    ("New Zealand", "Zimbabwe", "New Zealand"),
    ("West Indies", "Bangladesh", "Bangladesh"),
    ("Sri Lanka", "Bangladesh", "Bangladesh"),
    ("Sri Lanka", "Bangladesh", "Bangladesh"),
    ("Sri Lanka", "Bangladesh", "Bangladesh")
  )
  val df = data.toDF("Team_1", "Team_2", "Winner")
df.show
  exec 
  println( "METHOD 1 ")
  df.select("Team_1").union(df.select("Team_2")).groupBy("Team_1").agg(count("Team_1")).show()

  exec 
    println( "METHOD 2 ")
    df.select(array($"Team_1", $"Team_2").as("Team")).select("Team").withColumn("Team", explode($"Team")).groupBy("Team").agg(count("Team")).show()
  
  exec 
    println( "METHOD 3 ")
    val matchesCount = df.selectExpr("Team_1 as Teams").union(df.selectExpr("Team_2 as Teams"))
    matchesCount.groupBy("Teams").count().withColumnRenamed("count","MatchesPlayed").show()

  

  /**
    *
    * @param f
    * @tparam T
    * @return
    */
  def exec[T](f: => T) = 

    val starttime = System.nanoTime()
    println("t = " + f)
    val endtime = System.nanoTime()
    val elapsedTime = (endtime - starttime )
//    import java.util.concurrent.TimeUnit
//    val convertToSeconds = TimeUnit.MINUTES.convert(elapsedTime, TimeUnit.NANOSECONDS)
    println("time Elapsed " +  elapsedTime  )
  


结果:

+-----------+----------+-----------+
|     Team_1|    Team_2|     Winner|
+-----------+----------+-----------+
|      India|  Pakistan|      India|
|  Australia|     India|      India|
|New Zealand|  Zimbabwe|New Zealand|
|West Indies|Bangladesh| Bangladesh|
|  Sri Lanka|Bangladesh| Bangladesh|
|  Sri Lanka|Bangladesh| Bangladesh|
|  Sri Lanka|Bangladesh| Bangladesh|
+-----------+----------+-----------+

METHOD 1 
+-----------+-------------+
|     Team_1|count(Team_1)|
+-----------+-------------+
|  Sri Lanka|            3|
|      India|            2|
|West Indies|            1|
| Bangladesh|            4|
|   Zimbabwe|            1|
|New Zealand|            1|
|  Australia|            1|
|   Pakistan|            1|
+-----------+-------------+

t = ()
time Elapsed 2729302088
METHOD 2 
+-----------+-----------+
|       Team|count(Team)|
+-----------+-----------+
|  Sri Lanka|          3|
|      India|          2|
|West Indies|          1|
| Bangladesh|          4|
|   Zimbabwe|          1|
|New Zealand|          1|
|  Australia|          1|
|   Pakistan|          1|
+-----------+-----------+

t = ()
time Elapsed 646513918
METHOD 3 
+-----------+-------------+
|      Teams|MatchesPlayed|
+-----------+-------------+
|  Sri Lanka|            3|
|      India|            2|
|West Indies|            1|
| Bangladesh|            4|
|   Zimbabwe|            1|
|New Zealand|            1|
|  Australia|            1|
|   Pakistan|            1|
+-----------+-------------+

t = ()
time Elapsed 988510662

我观察到org.apache.spark.sql.functions.array 方法所用的时间(646513918 纳秒)比union 方法少...

【讨论】:

如果有用,你想vote吗? 只是想知道你的 exec 和 spark.time 内置函数之间有什么区别吗? 几乎做同样的事情,但火花时间将以毫秒为单位,这将以纳秒为单位,您可以自定义。我从 spark 1.3(老俱乐部成员 :-))开始在 spark 中工作,我没有这种奢侈的 spark 时间,所以我过去常常遵循这种方式。【参考方案2】:
val matchesCount = df.selectExpr("Team_1 as Teams").union(df.selectExpr("Team_2 as Teams"))
matchesCount.groupBy("Teams").count().withColumnRenamed("count","MatchesPlayed").show()

+-----------+--------------+
|      Teams|MatchesPlayed|
+-----------+--------------+
|  Sri Lanka|             3|
|      India|             2|
|West Indies|             1|
| Bangladesh|             4|
|   Zimbabwe|             1|
|New Zealand|             1|
|  Australia|             1|
|   Pakistan|             1|
+-----------+--------------+

【讨论】:

【参考方案3】:

您可以使用带有 select 语句的联合或使用 org.apache.spark.sql.functions.array

中的数组
 // METHOD 1  
df.select("Team_1").union(df.select("Team_2")).groupBy("Team_1").agg(count("Team_1")).show()

// METHOD 2
df.select(array($"Team_1", $"Team_2").as("Team")).select("Team").withColumn("Team",explode($"Team")).groupBy("Team").agg(count("Team")).show()

使用select 语句和union

+-----------+-------------+
|     Team_1|count(Team_1)|
+-----------+-------------+
|  Sri Lanka|            3|
|      India|            2|
|West Indies|            1|
| Bangladesh|            4|
|   Zimbabwe|            1|
|New Zealand|            1|
|  Australia|            1|
|   Pakistan|            1|
+-----------+-------------+

Time Elapsed : 1588835600

使用array

    +-----------+-----------+
    |       Team|count(Team)|
    +-----------+-----------+
    |  Sri Lanka|          3|
    |      India|          2|
    |West Indies|          1|
    | Bangladesh|          4|
    |   Zimbabwe|          1|
    |New Zealand|          1|
    |  Australia|          1|
    |   Pakistan|          1|
    +-----------+-----------+

    Time Elapsed : 342103600

使用org.apache.spark.sql.functions.array 的性能更好。

【讨论】:

数组方法比联合方法花费的时间更少。检查我的答案

以上是关于我们如何组合来自相同 data_type 的数据帧的两列的值并获取每个元素的计数?的主要内容,如果未能解决你的问题,请参考以下文章

组合具有不同列数的 Spark 数据帧

R如何排列数据帧的所有行,以便在列表中返回所有可能的行组合?

R:从一个数据帧中提取行,基于列名匹配来自另一个数据帧的值

如何根据多个条件将 1 个 pandas 数据帧合并或组合到另一个数据帧

如何组合来自4个mysql表的数据

我们如何添加具有相同ID的数据帧?