Spark-RDD算子

Posted areyouready

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark-RDD算子相关的知识,希望对你有一定的参考价值。

一、Spark-RDD算子简介

    RDD(Resilient Distributed DataSet)是分布式数据集。RDD是Spark最基本的数据的抽象。
    scala中的集合。RDD相当于一个不可变、可分区、里面的元素可以并行计算的集合。
    
    RDD特点:
    具有数据流模型的特点
    自动容错
    位置感知调度
    可伸缩性
    RDD允许用户在执行多个查询时将工作集缓存在内存中,可以重用工作集,大大的提升了查询速度。
    
    RDD类型分为:
    1)Transformation
    转换
    2)Action
    动作

二、RDD创建

    RDD分为两种类型:
    1)Transformation(lazy-》懒加载)
    2)Action(触发任务)

例子:

    scala> sc.textFile("/root/words.txt")
    res8: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25

    scala> sc.textFile("/root/words.txt").flatMap(_.split(" "))
    res9: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[31] at flatMap at <console>:25

    scala> res8
    res10: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25

    scala> res8.flatMap(_.split(" "))
    res11: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at flatMap at <console>:27

    scala> res8.flatMap(_.split(" ")).map((_,1))
    res12: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:27

    scala> res8.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
    res13: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:27

    scala> res13.collect
    res14: Array[(String, Int)] = Array((is,1), (love,2), (capital,1), (Beijing,2), (China,2), (I,2), (of,1), (the,1))

    scala> val rdd1 = sc.textFile("/root/words.txt")
    rdd1: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[39] at textFile at <console>:24

    scala> rdd1.count
    res15: Long = 3    
    
    scala> val list = List(1,3,5,7)
    list: List[Int] = List(1, 3, 5, 7)

    scala> list.map(_ * 100)
    res16: List[Int] = List(100, 300, 500, 700)

    scala> val rdd1 = sc.parallelize(list)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[40] at parallelize at <console>:26

    scala> rdd1.map(_ * 100)
    res17: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[41] at map at <console>:29

    scala> res17.collect
    res18: Array[Int] = Array(100, 300, 500, 700)

    scala> val rdd2 = sc.makeRDD(list)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at makeRDD at <console>:26

    scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[43] at parallelize at <console>:24

    scala> val rdd2 = rdd1.map(_ * 1000)
    rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[44] at map at <console>:26

    scala> rdd2.collect
    res19: Array[Int] = Array(1000, 2000, 3000, 4000, 5000)

三、常用Transformation

    Transformation特点:
    1)生成新的RDD
    2)lazy懒加载 等待处理
    3)并不会存储真正的数据,记录了转换关系

1、map(func)
2、flatMap(func)
3、sortby
4、reduceByKey

    scala> sc.textFile("/root/words.txt")
    res8: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25

    scala> res8.flatMap(_.split(" "))
    res11: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at flatMap at <console>:27

    scala> res8.flatMap(_.split(" ")).map((_,1))
    res12: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:27

    scala> res8.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
    res13: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:27

5、filter  过滤

    scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at parallelize at <console>:24

    scala> rdd1.filter(_ % 2 == 0)
    res24: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[49] at filter at <console>:27

    scala> res24.collect
    res25: Array[Int] = Array(2, 4)

6、union  并集

    scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7))
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[50] at parallelize at <console>:24

    scala> rdd1.union(rdd2)
    res26: org.apache.spark.rdd.RDD[Int] = UnionRDD[51] at union at <console>:29

    scala> res26.collect
    res27: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7)

    scala> rdd1 union rdd2
    res28: org.apache.spark.rdd.RDD[Int] = UnionRDD[52] at union at <console>:29

    scala> res28.collect
    res29: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7)

7、groupByKey  分组

    scala> val rdd3 = sc.parallelize(List(("Tom",18),("John",16),("Tom",20),("Mary",17),("John",23)))
    rdd3: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[53] at parallelize at <console>:24

    scala> rdd3.groupByKey
    res30: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[54] at groupByKey at <console>:27

    scala> res30.collect
    res31: Array[(String, Iterable[Int])] = Array((John,CompactBuffer(16, 23)), (Tom,CompactBuffer(18, 20)), (Mary,CompactBuffer(17)))

8、intersection  交集

    scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at <console>:24

    scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7))
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at <console>:24

    scala> rdd1.intersection(rdd2)
    res33: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[62] at intersection at <console>:29

    scala> res33.collect
    res34: Array[Int] = Array(3, 4, 1, 5, 2)

9、join  关联

    scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24

    scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24

    scala> rdd1.join(rdd2)
    res39: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[69] at join at <console>:29

    scala> res39.collect
    res40: Array[(String, (Int, Int))] = Array((John,(16,26)), (Tom,(18,28)))

10、leftOuterJoin  左连接

保留左侧RDD,右侧如果join上保留,没join上None

    scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24

    scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24

    scala> rdd1.leftOuterJoin(rdd2)
    res41: org.apache.spark.rdd.RDD[(String, (Int, Option[Int]))] = MapPartitionsRDD[72] at leftOuterJoin at <console>:29

    scala> res41.collect
    res42: Array[(String, (Int, Option[Int]))] = Array((John,(16,Some(26))), (Tom,(18,Some(28))), (Mary,(21,None)))

11、rightOuterJoin  右连接

    scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24

    scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24

    scala> rdd1.rightOuterJoin(rdd2).collect
    res43: Array[(String, (Option[Int], Int))] = Array((John,(Some(16),26)), (Tom,(Some(18),28)), (Cat,(None,17)))

12、cartesian  笛卡尔积

    scala> val rdd1 = sc.parallelize(List("Tom","Mary"))
    rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[76] at parallelize at <console>:24

    scala> val rdd2 = sc.parallelize(List("John","Joe"))
    rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[77] at parallelize at <console>:24

    scala> rdd1.cartesian(rdd2)
    res45: org.apache.spark.rdd.RDD[(String, String)] = CartesianRDD[78] at cartesian at <console>:29

    scala> res45.collect
    res46: Array[(String, String)] = Array((Tom,John), (Tom,Joe), (Mary,John), (Mary,Joe))

四、常用的Action

1、collect  收集

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

    scala> rdd1.collect
    res0: Array[Int] = Array(1, 2, 3, 4)   

2、saveAsTextFile(path)  存储文件
三份数据:5B 5B 600B
理想切分:5+5+600=610 610/3 = 203
5B一片
5B一片
203一片
203一片
203一片
1一片

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24                                         

    scala> rdd1.saveAsTextFile("/root/RDD1")
    
    //查看分区数
    scala> rdd1.partitions.length
    res3: Int = 4

3、count  计数

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
                                                                                    
    scala> rdd1.count
    res2: Long = 4

4、reduce  聚合

    scala> val rdd2 = sc.parallelize(List(1,2,3,4),2)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

    scala> rdd2.partitions.length
    res4: Int = 2

    scala> rdd2.reduce(_+_)
    res5: Int = 10

5、countByKey()  根据key计数

    scala> sc.parallelize(List(("Tom",18),("Tom",28),("John",14),("Mary",16)))
    res9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:25

    scala> res9.count
    res10: Long = 4

    scala> res9.countByKey()
    res11: scala.collection.Map[String,Long] = Map(Tom -> 2, Mary -> 1, John -> 1)  

    scala> res9.reduceByKey(_+_)
    res12: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:27

    scala> res9.reduceByKey(_+_).collect
    res13: Array[(String, Int)] = Array((Tom,46), (Mary,16), (John,14))

6、take(n)  取出多少个元素

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24                                         

    scala> rdd1.take(2)
    res15: Array[Int] = Array(1, 2)
    
    scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
    rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

    scala> rdd3.take(2)
    res17: Array[Int] = Array(3, 2)

7、first  返回RDD的第一个元素

    scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
    rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

    scala> rdd3.first
    res18: Int = 3

8、takeOrdered(n)  取出多少个元素 默认正序

    scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
    rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

    scala> rdd3.takeOrdered(2)
    res19: Array[Int] = Array(1, 2)

9、top(n)  倒序排序 取出元素

    scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
    rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

    scala> rdd3.top(2)
    res20: Array[Int] = Array(8, 7)

五、spark高级算子 

1、mapPartitionsWithIndex(func)
设置分区,并且查看每个分区中存放的元素
查看每个分区中元素
需要传递函数作为参数
val func = (index:Int,iter:Iterator[(Int)]) => {iter.toList.map(x => "partID:" + index + "," + "datas:" + x + "]").iterator}

    scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
    rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

    scala> val func = (index:Int,iter:Iterator[(Int)]) => {iter.toList.map(x => "partID:" + index + "," + "datas:" + x + "]").iterator}
    func: (Int, Iterator[Int]) => Iterator[String] = <function2>

    scala> rdd3.mapPartitionsWithIndex(func)
    res21: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at mapPartitionsWithIndex at <console>:29

    scala> rdd3.mapPartitionsWithIndex(func).collect
    res22: Array[String] = Array(partID:0,datas:1], partID:0,datas:2], partID:0,datas:3], partID:1,datas:4], partID:1,datas:5], partID:1,datas:6], partID:1,datas:7])

2、aggregate
聚合,先局部后全局
max 取最大值
min 取最小值

    scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
    rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

    scala> rdd3.aggregate(0)(_+_,_+_)
    res23: Int = 28

    scala> rdd3.max
    res24: Int = 7

    scala> rdd3.min
    res25: Int = 1
    
    
    scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
    rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:24

    scala> rdd3.aggregate(0)(math.max(_,_),_+_)
    res29: Int = 10

    scala> rdd3.aggregate(10)(math.max(_,_),_+_)
    res31: Int = 30

    // 1+2+3+20 + 4+5+6+7+20 + 20 = 88
    scala> rdd3.aggregate(20)(_+_,_+_)
    res32: Int = 88

    
    scala> val rdd4 = sc.parallelize(List("a","b","c","d","e"),2)
    rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[17] at parallelize at <console>:24

    scala> rdd4.aggregate("|")(_+_,_+_)
    res33: String = ||ab|cde

    scala> rdd4.aggregate("|")(_+_,_+_)
    res34: String = ||cde|ab
    

    scala> val rdd5 = sc.parallelize(List("12","23","234","3456"),2)
    rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24

    scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
    res35: String = 24

    scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
    res36: String = 42

    scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
    res37: String = 24

    scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
    res38: String = 42
    
    
    scala> val rdd6 = sc.parallelize(List("12","23","345",""),2)
    rdd6: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at parallelize at <console>:24

    scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
    res41: String = 01

    scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
    res42: String = 10

    scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
    res43: String = 01

    scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
    res48: String = 10
    

    scala> val rdd7 = sc.parallelize(List("12","23","","456"),2)
    rdd7: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24                                                             

    scala> rdd7.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
    res1: String = 11

    scala> ("").length
    res2: Int = 0

    scala> 0.length
    <console>:24: error: value length is not a member of Int
           0.length
             ^

    scala> 0.toString.length
    res5: Int = 1

    scala> rdd7.aggregate("0")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
    res6: String = 011

    scala> rdd7.aggregate("0")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
    res7: String = 011

3、aggregateByKey
根据key聚合,先局部再全局

    scala> val rdd8 = sc.parallelize(List(("cat",3),("cat",8),("mouse",6),("dog",8)))
    rdd8: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24

    scala> def func(index:Int,iter:Iterator[(String,Int)]):Iterator[String] = {iter.toList.map(x => "partID:" + index + "," + "values:" + x + "]").iterator}
    func: (index: Int, iter: Iterator[(String, Int)])Iterator[String]

    scala> rdd8.mapPartitionsWithIndex(func).collect
    res34: Array[String] = Array(partID:0,values:(cat,3)], partID:1,values:(cat,8)], partID:2,values:(mouse,6)], partID:3,values:(dog,8)])

    scala> rdd8.aggregateByKey(0)(_+_,_+_).collect
    res35: Array[(String, Int)] = Array((dog,8), (mouse,6), (cat,11))

4、combineByKey
aggregateByKey和reduceByKey底层调用都是combineByKey
最底层的方法,先局部累加,再全局累加

    scala> val rdd1 = sc.textFile("hdfs://192.168.146.111:9000/words.txt").flatMap(_.split("	")).map((_,1)).reduceByKey(_+_).collect
    rdd1: Array[(String, Int)] = Array((haha,1), (heihei,1), (hello,3), (Beijing,1), (world,1), (China,1))

    scala> val rdd2 = sc.textFile("hdfs://192.168.146.111:9000/words.txt").flatMap(_.split("	")).map((_,1)).combineByKey(x => x,(m:Int,n:Int) => (m+n),(a:Int,b:Int) => (a+b)).collect
    rdd2: Array[(String, Int)] = Array((haha,1), (heihei,1), (hello,3), (Beijing,1), (world,1), (China,1))

5、coalesce
coalesce(4,true)
分区数4
是否shuffle

repartition的实现,已默认加了shuffle

    scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24

    scala> rdd2.partitions.length
    res42: Int = 2

    scala> val rdd3 = rdd2.coalesce(4,true)
    rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at coalesce at <console>:26

    scala> rdd3.partitions.length
    res43: Int = 4

    scala> val rdd4 = rdd3.repartition(5)
    rdd4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at repartition at <console>:28

    scala> rdd4.partitions.length
    res44: Int = 5

6、filterByRange
过滤出指定范围的元素

    scala> val rdd6 = sc.parallelize(List(("a",3),("b",2),("d",5),("e",8)))
    rdd6: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:24

    scala> rdd6.filterByRange("b","d").collect
    res45: Array[(String, Int)] = Array((b,2), (d,5))

    scala> rdd6.filterByRange("b","e").collect
    res46: Array[(String, Int)] = Array((b,2), (d,5), (e,8))

7、flatMapValues
切分出每个元素

    scala> val rdd7 = sc.parallelize(List(("a","3 6"),("b","2 5"),("d","5 8")))
    rdd7: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at parallelize at <console>:24

    scala> rdd7.flatMapValues(_.split(" ")).collect
    res47: Array[(String, String)] = Array((a,3), (a,6), (b,2), (b,5), (d,5), (d,8))

8、foldByKey
需求:根据key来拼接字符串

    scala> val rdd8 = sc.parallelize(List("Tom","John","Mary","Joe"),2)
    rdd8: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[31] at parallelize at <console>:24

    scala> val rdd9 = rdd8.map(x => (x.length,x))
    rdd9: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[32] at map at <console>:26

    scala> rdd9.collect
    res48: Array[(Int, String)] = Array((3,Tom), (4,John), (4,Mary), (3,Joe))

    scala> rdd9.foldByKey("")(_+_).collect
    res49: Array[(Int, String)] = Array((4,JohnMary), (3,JoeTom))

9、foreach
遍历元素

import org.apache.spark.{SparkConf, SparkContext}

object ForeachDemo {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("ForeachDemo").setMaster("local[2]")
    val sc = new SparkContext(conf)

    //创建rdd
    val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
    rdd1.foreach(println(_))

    sc.stop()
  }
}

结果:

技术分享图片

10、keyBy
以什么为key
keys values
拿到key 拿到value

    scala> val rdd2 = sc.parallelize(List("Tom","John","Jack"),3)
    rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[37] at parallelize at <console>:24

    scala> val rdd3 = rdd2.keyBy(_.length)
    rdd3: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[38] at keyBy at <console>:26

    scala> rdd3.collect
    res60: Array[(Int, String)] = Array((3,Tom), (4,John), (4,Jack))

    scala> rdd3.keys.collect
    res61: Array[Int] = Array(3, 4, 4)

    scala> rdd3.values.collect
    res62: Array[String] = Array(Tom, John, Jack)

六、RDD并行化流程

技术分享图片

 



































以上是关于Spark-RDD算子的主要内容,如果未能解决你的问题,请参考以下文章

Spark-RDD

Spark-RDD创建并行分区

Spark-RDD 模型 以及运行原理

Flink 多流转换算子

Spark-RDD/DataFrame/DateSet

Spark——RDD算子