Spark Shell简单使用
Posted csguo
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark Shell简单使用相关的知识,希望对你有一定的参考价值。
基础
Spark的shell作为一个强大的交互式数据分析工具,提供了一个简单的方式学习API。它可以使用Scala(在Java虚拟机上运行现有的Java库的一个很好方式)或Python。在Spark目录里使用下面的方式开始运行:
- ./bin/spark-shell
- ./bin/spark-shell --master local[4]
- ./bin/spark-shell --master local[4] --jars code.jar
- scala> val textFile = sc.textFile("file:///home/hadoop/hadoop/spark/README.md")
- 16/07/24 03:30:53 INFO storage.MemoryStore: ensureFreeSpace(217040) called with curMem=321016, maxMem=280248975
- 16/07/24 03:30:53 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 212.0 KB, free 266.8 MB)
- 16/07/24 03:30:53 INFO storage.MemoryStore: ensureFreeSpace(20024) called with curMem=538056, maxMem=280248975
- 16/07/24 03:30:53 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.6 KB, free 266.7 MB)
- 16/07/24 03:30:53 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:43303 (size: 19.6 KB, free: 267.2 MB)
- 16/07/24 03:30:53 INFO spark.SparkContext: Created broadcast 2 from textFile at <console>:21
- textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at textFile at <console>:21
- cp log4j.properties.template log4j.properties
- vim log4j.properties
2. 另外,file:///home/hadoop/hadoop/spark/README.md,首部的file代表本地目录,注意file:后有三个斜杠(/);中间红色部分是我的spark安装目录,读者可根据自己的情况进行替换。
RDD的actions从RDD中返回值,transformations可以转换成一个新RDD并返回它的引用。下面展示几个action:
- scala> textFile.count()
- res0: Long = 98
- scala> textFile.first()
- res1: String = # Apache Spark
下面使用一个transformation,我们将使用filter函数对textFile这个RDD进行过滤,取出包含字符串"Spark"的行,并返回一个新的RDD:
- scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
- linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23
- scala> textFile.filter(line => line.contains("Spark")).count()
- res2: Long = 19
更多RDD操作
RDD actions和transformations能被用在更多的复杂计算中。比如想要找到一行中最多的单词数量:
- scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
- res3: Int = 14
- scala> import java.lang.Math
- import java.lang.Math
- scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
- res4: Int = 14
- scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
- wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:24
- scala> wordCounts.collect()
- res5: Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (have,1), (Try,1), (computation,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), (page](http://spark.apache.org/documentation.html),1), (Once,1), (application,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (all,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (Data.,1), (>...
缓存
Spark支持把数据集缓存到内存之中,当要重复访问时,这是非常有用的。举一个简单的例子:
- scala> linesWithSpark.cache()
- res6: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:23
- scala> linesWithSpark.count()
- res7: Long = 19
- scala> linesWithSpark.count()
- res8: Long = 19
- scala> linesWithSpark.count()
- res9: Long = 19