04常用RDD操作整理

Posted 2020-09-30

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了04常用RDD操作整理相关的知识，希望对你有一定的参考价值。

常用transformation

注：某些函数只有PairRDD只有，而普通的RDD则没有，比如gropuByKey、reduceByKey、sortByKey、join、cogroup等函数要根据Key进行分组或直接操作

RDD[U] map(f: T => U)

T：原RDD中元素类型

U：新RDD中元素类型

函数将T元素转换为新的U元素

rdd.map(x => x + 1)

{1, 2, 3, 3}

=>{2, 3, 4, 4}

RDD[(K, U)] mapValues[U](f: V => U)

K：key类型

V：value类型

将value转换为新的U元素，Key不变

rdd.mapValues(_ + 1)

{"class1", 80), ("class2", 70)}

=>{"class1", 81), ("class2", 71)}

RDD[U] flatMap(f: T => TraversableOnce[U])

TraversableOnce：集合与迭代器的父类

函数将T元素转换为含有新类型U元素的集合，并将这些集合展平（两层转换成一层）后的元素形成新的RDD

rdd.flatMap(x => x.to(3))

{1, 2, 3, 3}

=>{1, 2, 3, 2, 3, 3, 3}

RDD[T] filter(f: T => Boolean)

函数对每个元素进行过滤，通过的元素形成新的RDD

rdd.filter(x => x != 1)

{1, 2, 3, 3}

=>{2, 3, 3}

RDD[T] distinct()

去重

rdd.distinct()

{1, 2, 3, 3}

=>{1, 2, 3}

RDD[(K, Iterable[V])] gropuByKey()

根据key进行分组，同一组的元素组成Iterable<V>，并以(key, Iterable<V>)元组类型为元素作为新的RDD返回

rdd.groupByKey()

{("class1", 80), ("class2", 75), ("class1", 90), ("class2", 60)}

=>{("class1",[80,90]),("class2",[75,60])}

RDD[(K, Iterable[T])] groupBy(f: T => K)

T：原RDD元素类型

K：新RDD中元素Key的类型

根据函数将元素T映射成相应K后，以此K进行分组

rdd.groupBy({ case 1 => 1; case 2 => 2; case "二" => 2 })

{ 1, 2, "二" }

=>{(1,[1]),(2,[2, "二"])}

RDD[(K, V)] reduceByKey(func: (V, V) => V)

先根据key进行分组，再对同一组中的的value进行reduce操作：第一次调用函数时传入的是两个Key所对应的value，从第二次往后，传入的两个参数中的第一个为上次函数计算的结果，第二个参数为其它Key的value

rdd. reduceByKey(_ + _)

{("class1", 80), ("class2", 75), ("class1", 90), ("class2", 60)}

=>{("class1", 170),("class2", 135)}

RDD[(K, V)] sortByKey()

根据key的大小进行排序（注：并不是先以Key进行分组，再对组类进行排序，而是直接根据Key的值进行排序）

rdd.sortByKey(false)

{(65, "leo"), (50, "tom"),(100, "marry"), (85, "jack")}

=>{(100, "marry"),(85, "jack"),(65, "eo"),(50, "tom")}

RDD[T] sortBy( f: (T) => K, ascending: Boolean,numPartitions: Int)

根据转换后的值进行排序，传入的是一个(T) => K 转换函数

rdd.sortBy(_._2, false, 1)

这里根据value进行降序排序

{("leo", 65), ("tom", 50), ("marry", 100), ("jack", 80)}

=>{("marry", 100),("jack", 80),("leo", 65), ("leo", 65)}

RDD[(K, (V, W))] join(other: RDD[(K, W))

W：另一RDD元素的value的类型

对两个包含<key,value>对的RDD根据key进行join操作，返回类型<key,Tuple2(key,value)>

rdd.join(otherRdd)

{(1, "leo"),(2, "jack"),(3, "tom")}

{(1, 100), (2, 90), (3, 60), (1, 70), (2, 80), (3, 50)}

=>{(1,("leo",100)),(1,("leo",70)),(2, ("jack",90),(2, ("jack",80),(3, ("tom",60),(3, ("tom",50))}

RDD[(K, (Iterable[V], Iterable[W]))] cogroup(other: RDD[(K, W)])

同join，也是根据key进行join，只不过相同key的value分别存放到Iterable<value>中

rdd.cogroup(otherRdd)

{(1, "leo"),(2, "jack"),(3, "tom")}

{(1, 100), (2, 90), (3, 60), (1, 70), (2, 80), (3, 50)}

=>{(1,(["leo"],[100,70])),(2, (["jack"],[90,80])),(3, (["tom","lily"],[60,50]))}

RDD[T] union(other: RDD[T])

两个RDD 并集，包括重复的元素

rdd.union(otherRdd)

{ 1, 2, 2, 3, 3}

{ 3, 4, 5}

=>{1, 2, 2, 3, 3, 3, 4, 5}

RDD[T] intersection(other: RDD[T])

两个RDD 交集

rdd.intersection(otherRdd)

{ 1, 2, 2, 3, 3}

{ 3, 4, 5}

=>{3}

RDD[T] subtract(other: RDD[T])

两个RDD相减

rdd.subtract(otherRdd)

{ 1, 2, 2, 3, 3}

{ 3, 4, 5}

=>{1, 2, 2}

RDD[(T, U)] cartesian(other: RDD[U])

两个RDD相减笛卡儿积

rdd.cartesian(otherRdd)

{ 1, 2 }

{ 3, 4}

=>{(1,3),(1,4),(2,3),(2,4)}

RDD[U] mapPartitions(f: Iterator[T] => Iterator[U])

与map一样，只是转换时是以分区为单位，将一个分区所有元素包装成Iterator一次性传入函数进行处理，而不像map函数那样每个元素都会调用一个函数，即这里有几个分区则才调用几次函数

假设有N个元素，有M个分区，那么map的函数的将被调用N次,而mapPartitions被调用M次

val arr = Array(1, 2, 3, 4, 5)

val rdd = sc.parallelize(arr, 2)

rdd.mapPartitions((it: Iterator[Int]) => { var l = List[Int](); it.foreach((e: Int) => l = e * 2 :: l); l.iterator })

=>{2, 4, 6, 8, 10}

RDD[U] mapPartitionsWithIndex(f: (Int, Iterator[T]) => Iterator[U])

与mapPartitions类似，不同的时函数多了个分区索引的参数

RDD[Array[T]] glom()

将RDD的每个分区中的类型为T的元素转换换数组Array[T]

val arr = Array(1, 2, 3, 4, 5)

val rdd = sc.parallelize(arr, 2)

val arrRDD = rdd.glom()arrRDD.foreach { (arr: Array[Int]) => { println("[ " + arr.mkString(" ") + " ]"); } }

=>[ 1 2 ], [ 3 4 5 ]

常用action

T reduce(f: (T, T) => T)

对所有元素进行reduce操作

rdd.reduce(_ + _)

{1, 2, 2, 3, 3, 3}

=>14

Array[T] collect()

将RDD中所有元素返回到一个数组里

注意：This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver‘s memory.

rdd.collect()

{1, 2, 3, 3}

=>[1, 2, 3, 3]

Map[K, V] collectAsMap()

作用于K-V类型的RDD上，作用与collect不同的是collectAsMap函数不包含重复的key，对于重复的key，后面的元素覆盖前面的元素

rdd.collectAsMap()

{ ("leo", 65), ("tom", 50), ("tom", 100)}

=>{ ("leo", 65), ("tom", 100)}

Long count()

统计RDD 中的元素个数

rdd.count()

{1, 2, 3, 3}

=>4

Map[T, Long] countByValue()

各元素在 RDD 中出现的次数

注意：This method should only be used if the resulting map is expected to be small, as the whole thing is loaded into the driver‘s memory.

To handle very large results, consider using rdd.map(x => (x, 1L)).reduceByKey(_ + _), which returns an RDD[T, Long] instead of a map.

rdd.countByValue()

{1, 2, 3, 3}

=>Map(1 -> 1, 3 -> 2, 2 -> 1)

Map[K, Long] countByKey()

先根据Key进行分组，再对每组里的value分别进行计数统计

注意：This method should only be used if the resulting map is expected to be small, as the whole thing is loaded into the driver‘s memory.

To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which returns an RDD[T, Long] instead of a map.

{ ("leo", 65), ("tom", 50), ("tom", 100), ("tom", 100) }

=>Map(leo -> 1, tom -> 3)

T first()

取第一个元素，实质上是调用take(1)实现的

rdd.first()

{3, 2, 1, 4}

=>3

Array[T] take(num: Int)

从 RDD 中返回前 num 个元素

注意：This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver‘s memory.

rdd.take(2)

{3, 2, 1, 4}

=>[3, 2]

Array[T] top(num: Int ) (implicit ord: Ordering[T])

如果没有传递 ord参数，则使用隐式参数，且提供的默认隐式参数为升序排序，可以传递一个自定义的Ordering来覆盖默认提供。 top实现是将Ordering反序后再调用 takeOrdered的：takeOrdered(num)(ord.reverse)

默认从 RDD 中返回最最大的 num个元素

注意：This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver‘s memory.

rdd.top(2)

{3, 2, 1, 4}

=>[4, 3]

Array[T] takeOrdered(num: Int)(implicit ord: Ordering[T])

如果没有传递 ord参数，则使用隐式参数，且提供的默认隐式参数为升序排序，可以传递一个自定义的Ordering来覆盖默认提供

与top相反，默认取的是前面最小的num个元素

注意：This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver‘s memory.

rdd.takeOrdered(2)(myOrdering)

{3, 2, 1, 4}

=>[1, 2]

T fold(zeroValue: T)(op: (T, T) => T)

要求初始值类型与原始元素类型相同

和 reduce() 一样，但是需要

提供初始值。注意：每个分区应用op函数时，都会以zeroValue为初始值进行计算，然后将每个分区的结果合并时，还是会以zeroValue为初始值进行合并计算

rdd.fold(5)(_ + _)

val arr = Array(1, 2, 3, 4);

val rdd = sc.parallelize(arr, 2)//分成两分区[1,2][3,4]

println(rdd.fold(5)((v1, v2) => { println("v1 = " + v1 + " ; v2 = " + v2); v1 + v2 }))

=>

v1 = 5 ; v2 = 1 //第一个分区计算过程

v1 = 6 ; v2 = 2

=============

v1 = 5 ; v2 = 3 //第二个分区计算过程

v1 = 8 ; v2 = 4

=============

v1 = 5 ; v2 = 8 //将第一个分区存累加起来

=============

v1 = 13 ; v2 = 12//将第二个分区存累加起来

=============

25

U aggregate (zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U)

初始值类型与原始数据类型可以不同，但初始值类型决定了返回值类型

与fold一样，计算时需要提供初始值，不同的是，分区的计算函数（seqOp）与分区合并计算函数（combOp）是不同的，但fold分区计算函数与分区合并计算函数是同一函数

rdd.fold(5)(_ + _, _ + _)

val arr = Array(1, 2, 3, 4);

val rdd = sc.parallelize(arr, 2)

println(rdd.aggregate(5)(

(v1, v2) => { println("v1 = " + v1 + " ; v2 = " + v2); v1 + v2 },

(v1, v2) => { println("v1 = " + v1 + " ; v2 = " + v2); v1 + v2 })

)

过程与结果与上面的fold函数一样

Unit saveAsTextFile(path: String)

将RDD元素保存到文件中，对每个元素调用toString方法

Unit foreach(f: T => Unit)

遍历RDD中的每个元素

rdd.foreach(println(_))

无

附件列表

以上是关于04常用RDD操作整理的主要内容，如果未能解决你的问题，请参考以下文章

(c)2006-2024 SYSTEM All Rights Reserved IT常识