spark中的聚合函数总结

Posted 通凡

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark中的聚合函数总结相关的知识,希望对你有一定的参考价值。

PairRDDFunctions中的函数:

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

Aggregate the values of each key, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

defreduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

defreduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]

:scala.collection.Map[K,V])

Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a Map. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.

defgroupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.

  • Note

    As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.,This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairRDDFunctions.aggregateByKey or PairRDDFunctions.reduceByKey will provide much better performance.

    defcombineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]

    :org.apache.spark.rdd.RDD[(K,C)])

    Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the existing partitioner/parallelism level. This method is here for backward compatibility. It does not provide combiner classtag information to the shuffle.

    • See also

    combineByKeyWithClassTag

    defcombineByKeyWithClassTag[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)(implicit ct: ClassTag[C]): RDD[(K, C)]

    (implicitct:scala.reflect.ClassTag[C]):org.apache.spark.rdd.RDD[(K,C)])

    Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the existing partitioner/parallelism level.

    • Annotations

      @Experimental()

以上是关于spark中的聚合函数总结的主要内容,如果未能解决你的问题,请参考以下文章

如何将数组传递给 Spark (UDAF) 中的用户定义聚合函数

HyperLogLog函数在Spark中的高级应用

何时合并发生在Spark中的用户定义聚合函数UDAF中

Dataframe Spark Scala中的最后一个聚合函数

极简spark教程spark聚合函数

Spark 系列—— Spark SQL 聚合函数 Aggregations