saprk的groupby和groupbykey的区别
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了saprk的groupby和groupbykey的区别相关的知识,希望对你有一定的参考价值。
参考技术A我们先来看一下spark源码中关于这两个API的解释。其实groupBy是Transformations,JavaRDD中的方法,可以操作RDD和PairRDD,而groupByKey是actions,JavaPairRDD中的方法,因此操作的是PairRDD,具体的看返回的结果有什么不同
groupBy
<U> JavaPairRDD<U,java.lang.Iterable<T>> groupBy(Function<T,U> f)
Return an RDD of grouped elements. Each group consists of a key and a sequence of elements mapping to that key.
Parameters:
f - (undocumented)
Returns:
(undocumented)
groupByKey
public JavaPairRDD<K,java.lang.Iterable<V>> groupByKey()
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with the existing partitioner/parallelism level.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using JavaPairRDD.reduceByKey or JavaPairRDD.combineByKey will provide much better performance.
Returns:
(undocumented)
[Spark][Python]groupByKey例子
[Spark][Python]sortByKey 例子 的继续:
[Spark][Python]groupByKey例子
In [29]: mydata003.collect()
Out[29]:
[[u‘00001‘, u‘sku933‘],
[u‘00001‘, u‘sku022‘],
[u‘00001‘, u‘sku912‘],
[u‘00001‘, u‘sku331‘],
[u‘00002‘, u‘sku010‘],
[u‘00003‘, u‘sku888‘],
[u‘00004‘, u‘sku411‘]]
In [30]: mydata005=mydata003.groupByKey()
In [32]: mydata005.count()
Out[32]: 4
In [33]: mydata005.collect()
Out[33]:
[(u‘00004‘, <pyspark.resultiterable.ResultIterable at 0x7fcebe436b10>),
(u‘00001‘, <pyspark.resultiterable.ResultIterable at 0x7fcebe436850>),
(u‘00003‘, <pyspark.resultiterable.ResultIterable at 0x7fcebe436050>),
(u‘00002‘, <pyspark.resultiterable.ResultIterable at 0x7fcebe4361d0>)]
那么,对于这种:
(00004,sku411)
(00003,sku888)
(00003,sku022)
(00003,sku010)
(00003,sku594)
(00002,sku912)
理论上变成了这样形式的:
(00002,[sku912,sku331])
(00001,[sku022,sku010,sku933])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])
我们如何把它们都打印输出成如下的格式,我考虑需要用到函数,然后对RDD的每行的Value,看作list,再来遍历。
(等待下次编写)
00002
sku912
sku331
00001
sku022
sku010
sku933
00003
sku088
sku022
sku022
sku010
sku594
00004
sku411
以上是关于saprk的groupby和groupbykey的区别的主要内容,如果未能解决你的问题,请参考以下文章
Spark DataFrame 的 groupBy vs groupByKey
spark sql DataFrame 的 groupBy+agg 与 groupByKey+mapGroups
spark sql DataFrame 的 groupBy+agg 与 groupByKey+mapGroups
Spark API 详解/大白话解释 之 groupBy、groupByKey