Spark高级排序彻底解密(DT大数据梦工厂)

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark高级排序彻底解密(DT大数据梦工厂)相关的知识,希望对你有一定的参考价值。

内容:

1、基础排序算法实战;

2、二次排序算法实战;

3、更高局级别排序算法;

4、排序算法内幕解密;

为啥讲排序?因为在应用的时候都有排序要求。

海量数据经常排序之后要我们想要的内容。

==========基础排序算法============

scala> sc.setLogLevel("WARN")

scala> val x = sc.textFile("/historyserverforSpark/README.md", 3).flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_,1)

x: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:27

scala> x.map(pair=>(pair._2,pair._1)).sortByKey(false).map(pair=>(pair._2,pair._1)).collect

16/02/08 20:34:00 WARN client.AppClient$ClientEndpoint: Connection to Worker2:7077 failed; waiting for master to reconnect...

16/02/08 20:34:01 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...

16/02/08 20:34:01 WARN client.AppClient$ClientEndpoint: Connection to Worker2:7077 failed; waiting for master to reconnect...

res11: Array[(String, Int)] = Array(("",67), (the,21), (to,14), (Spark,13), (for,11), (and,10), (##,8), (a,8), (run,7), (is,6), (can,6), (on,5), (of,5), (in,5), (if,4), (also,4), (you,4), (build,3), (including,3), (Please,3), (use,3), (or,3), (Hadoop,3), (documentation,3), (example,3), (an,3), (You,3), (with,3), (For,2), (This,2), (Hive,2), (To,2), (SparkPi,2), (refer,2), (Interactive,2), (be,2), (./bin/run-example,2), (1000:,2), (tests,2), (examples,2), (at,2), (using,2), (Shell,2), (class,2), (`examples`,2), (set,2), (Hadoop,,2), (cluster,2), (supports,2), (Python,2), (particular,2), (general,2), (locally,2), (following,2), (which,2), (should,2), ([project,2), (do,2), (how,2), (It,2), (Scala,2), (detailed,2), (return,2), (one,2), (Python,,2), (building,2), (that,2), (SQL,2), (guidance...

/**
 * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
 * `collector `saveon the resulting RDD will return or output an ordered list of records
 * (in the `savecase, they will be written to multiple `part-Xfiles in the filesystem, in
 * order of the keys).
 */
// TODO: this currently doesn‘t work on P other than Tuple2!
def sortByKey(ascending: Boolean truenumPartitions: Int = self.partitions.length)
    : RDD[(KV)] = self.withScope
{
  val part = new RangePartitioner(numPartitionsselfascending)
  new ShuffledRDD[KVV](selfpart)
    .setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

reduceByKey(_+_,1),RDD的并发是继承的

技术分享

==========二次排序实战============

所谓二次排序就是排序的时候考虑两个纬度

建立HelloSort.txt,内容:

2 3

4 1

3 2

4 3

9 7

2 1

第一列有2个4,无法排了,这个时候就要借助第二列了,这个就是二次排序

二次排序知识入门,如果是10次排序,20次排序,100次排序呢?怎么办?所以二次排序要把这些都考虑了。。。

自定义Key才是以不变应万变!

JAVA版本代码

package com.dt.spark.SparkApps.cores;

import java.io.Serializable;

import scala.math.Ordered;

/**

 * 自定义二次排序

 * @author 威

 *

 */

public class SecondarySortKey implements Ordered<SecondarySortKey>,Serializable {

    //需要二次排序的key

    private int first;

    public int getFirst() {

        return first;

    }

    public void setFirst(int first) {

        this.first = first;

    }

    public int getSecond() {

        return second;

    }

    public void setSecond(int second) {

        this.second = second;

    }

    private int second;

    public SecondarySortKey(int first,int second){

        this.first = first;

        this.second = second;

    }

    @Override

    public boolean $greater(SecondarySortKey other) {

        if(this.first>other.getFirst()){

            return true;

        }else if(this.first == other.getFirst()&&this.second>other.getSecond()){

            return true;

        }

        return false;

    }

    @Override

    public boolean $greater$eq(SecondarySortKey other) {

        if(this.$greater(other)){

            return true;

        }else if(this.first==other.getFirst()&&this.second==other.getSecond()){

            return true;

        }

        return false;

    }

    @Override

    public boolean $less(SecondarySortKey other) {

        if(this.first<other.getFirst()){

            return true;

        }else if(this.first==other.getFirst()&&this.second<other.getSecond()){

            return true;

        }

        return false;

    }

    @Override

    public boolean $less$eq(SecondarySortKey other) {

        if(this.$less(other)){

            return true;

        }else if(this.first==other.getFirst()&&this.second==other.getSecond()){

            return true;

        }

        return false;

    }

    @Override

    public int compare(SecondarySortKey other) {

        if(this.first-other.getFirst() != 0){

            return this.first-other.getFirst();

        }else{

            return this.second-other.getSecond();

        }

    }

    @Override

    public int compareTo(SecondarySortKey other) {

        if(this.first-other.getFirst() != 0){

            return this.first-other.getFirst();

        }else{

            return this.second-other.getSecond();

        }

    }

    @Override

    public int hashCode() {

        final int prime = 31;

        int result = 1;

        result = prime * result + first;

        result = prime * result + second;

        return result;

    }

    @Override

    public boolean equals(Object obj) {

        if (this == obj)

            return true;

        if (obj == null)

            return false;

        if (getClass() != obj.getClass())

            return false;

        SecondarySortKey other = (SecondarySortKey) obj;

        if (first != other.first)

            return false;

        if (second != other.second)

            return false;

        return true;

    }

}

package com.dt.spark.SparkApps.cores;


import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.api.java.function.VoidFunction;


import scala.Tuple2;


/**

 * 二次排序

 * @author

 * 第一步:按照ordered和 Serializable接口实现自定义排序的key

 * 第二步:将要进行二次排序的文件加载进来生成key,value类型的RDD

 * 第三步:基于自定义的使用sortByKey进行二次排序

 * 第四步:去除掉排序的key,只保留排序的结果

 */

public class SecondarySortApp {

        public static void main(String[] args) {

              SparkConf conf = new SparkConf().setAppName("SecondarySortApp" ).setMaster("local");

              JavaSparkContext sc = new JavaSparkContext(conf);// 其地产曾实际上就是Scala的SparkContext

              JavaRDD<String> lines = sc.textFile("F:/exepro/scala/HelloSort.txt" );

              

              JavaPairRDD<SecondarySortKey, String> pairs = lines.mapToPair( new PairFunction<String, SecondarySortKey, String>() {

                      @Override

                      public Tuple2<SecondarySortKey, String> call(String line) throws Exception {

                           String[] splited = line.split( " ");

                           SecondarySortKey key = new SecondarySortKey(Integer.valueOf(splited[0]),Integer. valueOf(splited[1]));

                            return new Tuple2<SecondarySortKey,String>(key,line);

                     }

              });

              

              JavaPairRDD<SecondarySortKey, String> sorted = pairs.sortByKey();

              

               //过滤掉排序后自定义的key,保留排序的结果

               JavaRDD<String> secondarySorted = sorted .map(new Function<Tuple2<SecondarySortKey,String>, String>() {


                      @Override

                      public String call(Tuple2<SecondarySortKey, String> sortedContent) throws Exception {

                           

                            return sortedContent ._2 ;

                     }

              });

              

               secondarySorted.foreach(new VoidFunction<String>() {


                      @Override

                      public void call(String str) throws Exception {

                            // TODO Auto-generated method stub

                           System. out.println(str );

                     }

              });

              

              

       }

}

结果:

16/02/08 21:37:49 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms

2 1

2 3

3 2

4 1

4 3

9 7

16/02/08 21:37:49 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver

Scala版本代码:

package com.dt.spark.cores

/**
  * 自定义二次排序
  * Created by 威 on 2016/2/9.
  */
class SecondarySortKey(val first:Int,val second:Intextends Ordered[SecondarySortKey] with Serializable {
  def compare(other:SecondarySortKey): Int ={
    if(this.first-other.first!=0){
      this.first-other.first
    }else{
      this.second-other.second
    }
  }
}

package com.dt.spark.cores

import org.apache.spark.{SparkContextSparkConf}

/**
  * 二次排序
  *
  * @author 
  * 第一步:按照ordered和Serializable接口实现自定义排序的key
  * 第二步:将要进行二次排序的文件加载进来生成key,value类型的RDD
  * 第三步:基于自定义的使用sortByKey进行二次排序
  * 第四步:去除掉排序的key,只保留排序的结果
  */
object SecondarySortApp {
  def main(args: Array[String]) {
    val conf = new SparkConf()//创建SparkConf对象
    conf.setAppName("SecondarySortApp")//设置应用程序的名称,在程序运行的监控界面可以看到名称
    conf.setMaster("local")//此时程序在本地运行,不需要安装Spark集群
    val sc = new SparkContext(conf)//通过创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
    val lines = sc.textFile("F:/exepro/scala/HelloSort.txt")//通过HadoopRDD以及MapPartitionRDD获取文件中每一行的内容本身

    val pairWithSortKey = lines.map(line=>{
      val splited = line.split(" ")
      (new SecondarySortKey(splited(0).toInt,splited(1).toInt),line)
    })

    val sorted = pairWithSortKey.sortByKey(false)

    val secondarySorted = sorted.map(sortedLine=>sortedLine._2)

    secondarySorted.collect().foreach(println)

  }
}

Scala太简洁了。

作业:

1、用Scala实现二次排序(三次排序),key用object的apply来实现;

2、自己去阅读RangePartitioner的源代码;


本文出自 “一枝花傲寒” 博客,谢绝转载!

以上是关于Spark高级排序彻底解密(DT大数据梦工厂)的主要内容,如果未能解决你的问题,请参考以下文章

Spark Executor内幕彻底解密(DT大数据梦工厂)

Spark Sort-Based Shuffle内幕彻底解密(DT大数据梦工厂)

彻底解密WordCount运行原理(DT大数据梦工厂)

Master HA彻底解密(DT大数据梦工厂)

SparkRDD解密(DT大数据梦工厂)

CacheManager彻底解密:CacheManager运行原理流程图和源码详解(DT大数据梦工厂)