SPARK

Posted 努力,奋斗

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SPARK相关的知识,希望对你有一定的参考价值。

Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more complete reference at the RDD programming guide. However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the SQL programming guide to get more information about Dataset.

 

scala> val text=spark.read.textFile("/tmp/20171024/tian.txt")
text: org.apache.spark.sql.Dataset[String] = [value: string]

scala> text.count
res0: Long = 6

scala> val text=sc.textFile("/tmp/20171024/tian.txt")
text: org.apache.spark.rdd.RDD[String] = /tmp/20171024/tian.txt MapPartitionsRDD[7] at textFile at <console>:24

scala> text.count
res1: Long = 6

You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read theAPI doc.

Caching

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:

scala> text.cache()
res2: text.type = /tmp/20171024/tian.txt MapPartitionsRDD[7] at textFile at <console>:24

scala> text.count
res3: Long = 6

 It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes.

以上是关于SPARK的主要内容,如果未能解决你的问题,请参考以下文章

spark关于join后有重复列的问题(org.apache.spark.sql.AnalysisException: Reference '*' is ambiguous)(代码片段

Spark闭包与序列化

spark 例子wordcount topk

Spark:如何加速 foreachRDD?

Spark发现匹配字符串的出现次数

控制 spark-sql 和数据帧中的字段可空性