Spark基础编程学习01
Posted Weikun Xing
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark基础编程学习01相关的知识,希望对你有一定的参考价值。
文章目录
Spark介绍
Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架,Spark,拥有Hadoop MapReduce所具有的优点;但不同于MapReduce的是——Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。
Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。
Spark 是在 Scala 语言中实现的,它将 Scala 用作其应用程序框架。与 Hadoop 不同,Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。
启动工作(单机伪分布式)
cd /usr/local/hadoop/sbin/
./start-all.sh
cd /usr/local/spark/sbin/
./start-all.sh
cd /usr/local/spark/bin/
./spark-shell
Spark context Web UI available at http://192.168.10.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1647598333367).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\\ \\/ _ \\/ _ `/ __/ '_/
/___/ .__/\\_,_/_/ /_/\\_\\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
数据
以学生成绩数据创建RDD
从内存中已有数据创建RDD
parallelize
通过parallelize创建一个RDD,默认分区个数为2.设置分区个数为3后创建RDD,查询结果显示分区个数为3.
scala> val data=Array(1,2,3,4,5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData=sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:26
scala> distData.partitions.size
res3: Int = 2
scala> val distData=sc.parallelize(data,3)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:26
scala> distData.partitions.size
res4: Int = 3
makeRDD
scala> val seq=Seq((1,Seq("iteblog.com","sparkhost1.con")),(3,Seq("iteblog.com","sparkhost2.com")),(2,Seq("iteblog.com","sparkhost3.com")))
seq: Seq[(Int, Seq[String])] = List((1,List(iteblog.com, sparkhost1.con)), (3,List(iteblog.com, sparkhost2.com)), (2,List(iteblog.com, sparkhost3.com)))
scala> val iteblog=sc.makeRDD(seq)
iteblog: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at makeRDD at <console>:26
scala> iteblog.collect
res5: Array[Int] = Array(1, 3, 2)
scala> iteblog.partitions.size
res6: Int = 3
scala> iteblog.preferredLocations(iteblog.partitions(0))
res7: Seq[String] = List(iteblog.com, sparkhost1.con)
scala> iteblog.preferredLocations(iteblog.partitions(1))
res8: Seq[String] = List(iteblog.com, sparkhost2.com)
scala> iteblog.preferredLocations(iteblog.partitions(2))
res10: Seq[String] = List(iteblog.com, sparkhost3.com)
从外部存储创建RDD
从HDFS文件创建RDD
在HDFS上有一个文件"/user/root/test.txt"
hadoop@master:~$ hdfs dfs -mkdir -p /user/root/
hadoop@master:~$ cd /home/hadoop/桌面/
hadoop@master:~/桌面$ touch test.txt
hadoop@master:~/桌面$ vim test.txt
hadoop@master:~/桌面$ hdfs dfs -put test.txt /user/root/test.txt
hadoop@master:~/桌面$ hdfs dfs -cat /user/root/test.txt
hello
读取该文件创建一个RDD
scala> val test=sc.textFile("/user/root/test.txt")
test: org.apache.spark.rdd.RDD[String] = /user/root/test.txt MapPartitionsRDD[9] at textFile at <console>:24
从Linux本地文件创建RDD
scala> val test=sc.textFile("file:///usr/local/hadoop/etc/hadoop/core-site.xml")
test: org.apache.spark.rdd.RDD[String] = file:///usr/local/hadoop/etc/hadoop/core-site.xml MapPartitionsRDD[15] at textFile at <console>:24
scala> test.count
res13: Long = 30
任务实现
hadoop@master:/home/dblab$ hdfs dfs -put student.txt /user/root/
hadoop@master:/home/dblab$ hdfs dfs -put result_bigdata.txt /user/root/
hadoop@master:/home/dblab$ hdfs dfs -put result_math.txt /user/root/
bigdata为大数据基础成绩表创建的RDD
math为数学成绩表创建的RDD
scala> val bigdata=sc.textFile("result_bigdata.txt")
bigdata: org.apache.spark.rdd.RDD[String] = result_bigdata.txt MapPartitionsRDD[17] at textFile at <console>:24
scala> val math=sc.textFile("result_math.txt")
math: org.apache.spark.rdd.RDD[String] = result_math.txt MapPartitionsRDD[19] at textFile at <console>:24
查询学生成绩表中的前5名
使用map转换数据
通过map的方法将每一个值平方
scala> val distData=sc.parallelize(List(1,3,45,3,76))
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24
scala> val sq_dist=distData.map(x=>x*x)
sq_dist: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at map at <console>:26
使用sortBy()排序
利用一个存放了3个二元组的List集合创建一个RDD,对元组第二个值降序排序,分区设置为1.
scala> val data=sc.parallelize(List((1,3),(45,3),(7,6)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[22] at parallelize at <console>:24
scala> val sort_data=data.sortBy(x=>x._2,false,1)
sort_data: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[25] at sortBy at <console>:26
使用collect()查询
查询我们刚才的得到的两个值
scala> sq_dist.collect
res15: Array[Int] = Array(1, 9, 2025, 9, 5776)
scala> sort_data.collect
res16: Array[(Int, Int)] = Array((7,6), (1,3), (45,3))
还有一种比较少用的用法
scala> val one:PartialFunction[Int,String]=case 1 => "one";case _ => "other"
one: PartialFunction[Int,String] = <function1>
scala> val data=sc.parallelize(List(2,3,1))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> data.collect(one).collect
res17: Array[String] = Array(other, other, one)
使用flatMap转换数据
用map分割后,每个元素对应返回一个迭代器,即数组。flatMap在进行同map一样的操作后,把3个迭代器的元素扁平化(压成同一级别),全部作为同级别的元素保存在新RDD中。
scala> val test=sc.parallelize(List("How are you","I am fine","What about you"))
test: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[29] at parallelize at <console>:24
scala> test.collect
res18: Array[String] = Array(How are you, I am fine, What about you)
scala> test.map(x=>x.split(" ")).collect
res19: Array[Array[String]] = Array(Array(How, are, you), Array(I, am, fine), Array(What, about, you))
scala> test.flatMap(x=>x.split(" ")).collect
res20: Array[String] = Array(How, are, you, I, am, fine, What, about, you)
使用take()方式查询某几个值
scala> val data=sc.parallelize(1 to 10)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:24
scala> data.take(5)
res21: Array[Int] = Array(1, 2, 3, 4, 5)
scala> data.collect
res22: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
任务实现
scala> val bigdata=sc.textFile("/user/root/result_bigdata.txt")
bigdata: org.apache.spark.rdd.RDD[String] = /user/root/result_bigdata.txt MapPartitionsRDD[40] at textFile at <console>:24
scala> val math=sc.textFile("/user/root/result_math.txt")
math: org.apache.spark.rdd.RDD[String] = /user/root/result_math.txt MapPartitionsRDD[42] at textFile at <console>:24
scala> val m_bigdata=bigdata.mapx=>val line=x.split("\\t");(line(0),line(1),line(2).toInt)
m_bigdata: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[43] at map at <console>:26
scala> val m_math=math.mapx=>val line=x.split("\\t");(line(0),line(1),line(2).toInt)
m_math: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[44] at map at <console>:26
scala> val sort_bigdata=m_bigdata.sortBy(x=>x._3,false)
sort_bigdata: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[49] at sortBy at <console>:28
scala> val sort_math=m_math.sortBy(x=>x._3,false)
sort_math: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[54] at sortBy at <console>:28
scala> sort_bigdata.take(5)
res23: Array[(String, String, Int)] = Array((1003,大数据基础,100), (1007,大数据基础,100), (1004,大数据基础,99), (1002,大数据基础,94), (1006,大数据基础,94))
scala> sort_math.take(5)
res24: Array[(String, String, Int)] = Array((1003,应用数学,100), (1004,应用数学,100), (1001,应用数学,96), (1002,应用数学,94), (1005,应用数学,94))
以上是关于Spark基础编程学习01的主要内容,如果未能解决你的问题,请参考以下文章