spark streaming checkpoint
Posted 牵牛花
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark streaming checkpoint相关的知识,希望对你有一定的参考价值。
Checkpoint机制
通过前期对Spark Streaming的理解,我们知道,Spark Streaming应用程序如果不手动停止,则将一直运行下去,在实际中应用程序一般是24小时*7天不间断运行的,因此Streaming必须对诸如系统错误、JVM出错等与程序逻辑无关的错误(failures )具体很强的弹性,具备一定的非应用程序出错的容错性。Spark Streaming的Checkpoint机制便是为此设计的,它将足够多的信息checkpoint到某些具备容错性的存储系统如HDFS上,以便出错时能够迅速恢复。有两种数据可以chekpoint:
(1)Metadata checkpointing
将流式计算的信息保存到具备容错性的存储上如HDFS,Metadata Checkpointing适用于当streaming应用程序Driver所在的节点出错时能够恢复,元数据包括:
Configuration(配置信息) - 创建streaming应用程序的配置信息
DStream operations - 在streaming应用程序中定义的DStreaming操作
Incomplete batches - 在列队中没有处理完的作业
(2)Data checkpointing
将生成的RDD保存到外部可靠的存储当中,对于一些数据跨度为多个bactch的有状态tranformation操作来说,checkpoint非常有必要,因为在这些transformation操作生成的RDD对前一RDD有依赖,随着时间的增加,依赖链可能会非常长,checkpoint机制能够切断依赖链,将中间的RDD周期性地checkpoint到可靠存储当中,从而在出错时可以直接从checkpoint点恢复。
具体来说,metadata checkpointing主要还是从drvier失败中恢复,而Data Checkpoing用于对有状态的transformation操作进行checkpointing
http://blog.csdn.net/wisgood/article/details/55667612
http://www.cnblogs.com/dt-zhw/p/5664663.html
import java.io.File import java.nio.charset.Charset import com.google.common.io.Files import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.streaming.{Time, Seconds, StreamingContext} import org.apache.spark.util.IntParam /** * Counts words in text encoded with UTF8 received from the network every second. * * Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory> <output-file> * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive * data. <checkpoint-directory> directory to HDFS-compatible file system which checkpoint data * <output-file> file to which the word counts will be appended * * <checkpoint-directory> and <output-file> must be absolute paths * * To run this on your local machine, you need to first run a Netcat server * * `$ nc -lk 9999` * * and run the example as * * `$ ./bin/run-example org.apache.spark.examples.streaming.RecoverableNetworkWordCount \\ * localhost 9999 ~/checkpoint/ ~/out` * * If the directory ~/checkpoint/ does not exist (e.g. running for the first time), it will create * a new StreamingContext (will print "Creating new context" to the console). Otherwise, if * checkpoint data exists in ~/checkpoint/, then it will create StreamingContext from * the checkpoint data. * * Refer to the online documentation for more details. */ object RecoverableNetworkWordCount { def createContext(ip: String, port: Int, outputPath: String, checkpointDirectory: String) : StreamingContext = { //程序第一运行时会创建该条语句,如果应用程序失败,则会从checkpoint中恢复,该条语句不会执行 println("Creating new context") val outputFile = new File(outputPath) if (outputFile.exists()) outputFile.delete() val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount").setMaster("local[4]") // Create the context with a 1 second batch size val ssc = new StreamingContext(sparkConf, Seconds(1)) ssc.checkpoint(checkpointDirectory) //将socket作为数据源 val lines = ssc.socketTextStream(ip, port) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.foreachRDD((rdd: RDD[(String, Int)], time: Time) => { val counts = "Counts at time " + time + " " + rdd.collect().mkString("[", ", ", "]") println(counts) println("Appending to " + outputFile.getAbsolutePath) Files.append(counts + "\\n", outputFile, Charset.defaultCharset()) }) ssc } //将String转换成Int private object IntParam { def unapply(str: String): Option[Int] = { try { Some(str.toInt) } catch { case e: NumberFormatException => None } } } def main(args: Array[String]) { if (args.length != 4) { System.err.println("You arguments were " + args.mkString("[", ", ", "]")) System.err.println( """ |Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory> | <output-file>. <hostname> and <port> describe the TCP server that Spark | Streaming would connect to receive data. <checkpoint-directory> directory to | HDFS-compatible file system which checkpoint data <output-file> file to which the | word counts will be appended | |In local mode, <master> should be \'local[n]\' with n > 1 |Both <checkpoint-directory> and <output-file> must be absolute paths """.stripMargin ) System.exit(1) } val Array(ip, IntParam(port), checkpointDirectory, outputPath) = args //getOrCreate方法,从checkpoint中重新创建StreamingContext对象或新创建一个StreamingContext对象 val ssc = StreamingContext.getOrCreate(checkpointDirectory, () => { createContext(ip, port, outputPath, checkpointDirectory) }) ssc.start() ssc.awaitTermination() } }
以上是关于spark streaming checkpoint的主要内容,如果未能解决你的问题,请参考以下文章
.Spark Streaming(上)--实时流计算Spark Streaming原理介
Spark SQL + Window + Streaming 问题 - 使用 Spark Streaming 运行时,Spark SQL 查询需要很长时间才能执行
Spark Streaming实时流处理项目实战Spark Streaming整合Kafka实战一
Spark Streaming实时流处理项目实战Spark Streaming整合Kafka实战一
Spark Streaming实时流处理项目实战Spark Streaming整合Kafka实战一
Spark Streaming:Spark Structured Streaming 中不允许使用 Kafka 组 ID