SparkRDD解密(DT大数据梦工厂)

Posted 2020-06-13

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了SparkRDD解密(DT大数据梦工厂)相关的知识，希望对你有一定的参考价值。

第一阶段，彻底精通Spark

第二阶段，从0起步，操作项目

Hadoop是大数据的基础设施，存储等等

Spark是计算核心所在

1、RDD：基于工作集的应用抽象

2、RDD内幕解密

3、RDD思考

不掌握RDD的人，不可能成为Spark的高手

绝对精通RDD，解决问题的能力大大提高

各种框架底层封装的都是RDD，RDD提供了通用框架

RDD是Spark的通用抽象基石

顶级SPark高手，

1、能解决问题、性能调优；

2、Spark高手拿Spark过来就是修改的

==========基于工作集的应用抽象============

MapReduce是基于数据集的

无论是工作集还是数据集，共同特征：

位置感知、容错、负载均衡

基于数据集的处理，工作方式是从物理存储设备上加载数据，然后操作数据，然后写入物理存储设备。

基于数据集的操作，不适应的场景：1、不适宜大量的迭代；2、不合适交互式查询。重点是基于数据流的方式不能复用曾经的结果或者中间计算结果。

RDD是基于工作集的，增加了一点：Resilient Distributed DataSet，弹性分布式数据集

弹性：1、自动进行内存和硬盘数据交换；2、基于Lineage的高效容错；2、Task如果失败会进行特定次数的重试；4、Stage失败会进行特定次数的重试，只会计算失败的分片；5、checkpoint和persist；6、DAG、TASK和资源管理无关；7、数据分片的高度弹性,repartition（数据比较小，可能效率低，几个会合并；数据过大，可能会拆一点）

100万个拆分成1万个分片

1万个变成10万个分片，一般可能要shuffle

/**

* Return a new RDD that has exactly numPartitions partitions.

* Can increase or decrease the level of parallelism in this RDD. Internally, this uses

* a shuffle to redistribute data.

* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,

* which can avoid performing a shuffle.

def repartition( numPartitions: Int)(implicit ord : Ordering[T] = null): RDD[T ] = withScope {

coalesce( numPartitions, shuffle = true)

}

/**

* Return a new RDD that is reduced into `numPartitions` partitions.

* This results in a narrow dependency, e.g. if you go from 1000 partitions

* to 100 partitions, there will not be a shuffle, instead each of the 100

* new partitions will claim 10 of the current partitions.

* However, if you‘re doing a drastic coalesce, e.g. to numPartitions = 1,

* this may result in your computation taking place on fewer nodes than

* you like (e.g. one node in the case of numPartitions = 1). To avoid this,

* you can pass shuffle = true. This will add a shuffle step, but means the

* current upstream partitions will be executed in parallel (per whatever

* the current partitioning is).

* Note: With shuffle = true, you can actually coalesce to a larger number

* of partitions. This is useful if you have a small number of partitions,

* say 100, potentially with a few partitions being abnormally large. Calling

* coalesce(1000, shuffle = true) will result in 1000 partitions with the

* data distributed using a hash partitioner.

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord : Ordering[T ] = null)

: RDD[T] = withScope {

if ( shuffle) {

/** Distributes elements evenly across output partitions, starting from a random partition. */

val distributePartition = ( index: Int, items: Iterator[T ]) => {

var position = (new Random(index)).nextInt(numPartitions)

items. map { t =>

// Note that the hash code of the key will just be the key itself. The HashPartitioner

// will mod it with the number of total partitions.

position = position + 1

( position, t)

}

} : Iterator[(Int, T)]

// include a shuffle step so that our upstream tasks are still distributed

new CoalescedRDD(

new ShuffledRDD[Int, T, T](mapPartitionsWithIndex( distributePartition),

new HashPartitioner(numPartitions )),

numPartitions).values

} else {

new CoalescedRDD( this, numPartitions )

}

位置感知，MapReduce Partition之后不管位置了，Spark Partition之后还能知道

基于工作集：Spark RDD可以复用曾经的结果或者中间计算结果，比如100个人查询一个东西，1个人查过，后面的可以复用前面的结果或者中间计算结果。这个是Spark帮忙做的。

如果Stage有1000个步骤，默认情况下只产生一次结果。

RDD是一个只读分区的集合。是分布式函数式编程的抽象。框架处理了分布式了，初学者只要写代码就行。中高级必须知道内幕。

==========lazy============

RDD核心是lazy，开始不算，只是产生操作标记

f(x) = x+1

x = y+1

y=z+1

比如

/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

MapPatitionsRDD 第一个参数this，用的是父RDD，所以只是一层一层标记！！！lazy级别！！！最后统一处理！！！

/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

map也是第一个参数this，用的是父RDD，所以只是一层一层标记！！！lazy级别！！！最后统一处理！！！

数据量越大，计算步骤越多，比MapReduce优势越明显。

有一个缺点，一直要复用别人的结果，所以对内存确实有消耗。但是现在内存一般不是很贵，都能买很大的。

==========容错============

Spark每次操作都会产生新的RDD，从父RDD过来的。最后执行的时候从后往前一点点追朔。有链条关系，容错开销就非常低。

常规容错的方式：数据检查点checkpoint以及记录数据的更新

数据检查点checkpoint一直记录数据和检查数据

记录数据更新缺点：1、复杂，每次一点数据都要更新；2、操作全局数据容易失控，耗费性能；3、重算难处理

RDD通过记录数据更新的方式为何很高效？

1、RDD是不可变的+lazy；

2、RDD的写操作是粗粒度操作，是为了效率和简化，每次操作都作用在数据集合；如果力度太细，效率就低了；

RDD的读操作既可以是粗粒度的，也可以是细粒度的；

RDD不适合做细粒度或者异步的应用

RDD一系列计算分片上面的计算逻辑都是一样的，用compute来计算

/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]

所有RDD操作返回的都是迭代器，好处是框架不用担心结果是啥，可以让所有的框架无缝集成。比如机器学习直接调用SparkSQL等等，甚至自己的写的框架。自己的框架和Spark的各种框架互相调用。

所以是核裂变级别的。Spark一旦推出新框架，或者自己写，所有框架都增强了功能。

/**
* Optionally overridden by subclasses to specify placement preferences.
*/
protected def getPreferredLocations(split: Partition): Seq[String] = Nil

可以处理除了实时事务性处理之外的一切数据。

不能做实施事务性处理，不是不能做实时性处理。

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()

this.type绝对是超级级别的设计，让spark各种调用。

Spark就是要做一体化的数据处理框架。

RDD缺陷：不支持细粒度的写（更新）操作、增量迭代计算

王家林老师名片：

中国Spark第一人

新浪微博：http://weibo.com/ilovepains

微信公众号：DT_Spark

博客：http://blog.sina.com.cn/ilovepains

手机：18610086859

QQ：1740415547

邮箱：[email protected]

本文出自 “一枝花傲寒” 博客，谢绝转载！

以上是关于SparkRDD解密(DT大数据梦工厂)的主要内容，如果未能解决你的问题，请参考以下文章