Spark WordCount 文档词频计数
Posted yszd
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark WordCount 文档词频计数相关的知识,希望对你有一定的参考价值。
一.使用数据
Apache Spark is a fast and general-purpose cluster computing system.It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
二.实现代码
package big.data.analyse.wordcount import org.apache.spark.sql.SparkSession /** * Created by zhen on 2019/3/9. */ object WordCount { def main(args: Array[String]) { val spark = SparkSession.builder().appName("WordCount") .master("local[2]") .getOrCreate() // 加载数据 val textRDD = spark.sparkContext.textFile("src/big/data/analyse/wordcount/wordcount.txt") val result = textRDD.map(row => row.replace(",", ""))//去除文字中的,防止出现歧义 .flatMap(row => row.split(" "))//把字符串转换为字符集合 .map(row => (row, 1))//把每个字符串转换为map,便于计数 .reduceByKey(_+_)//计数 // 打印结果 result.foreach(println) } }
三.计算结果
(Spark,3) (GraphX,1) (graphs.,1) (learning,1) (general-purpose,1) (Python,1) (APIs,1) (provides,1) (that,1) (is,1) (a,2) (R,1) (high-level,1) (general,1) (processing,2) (fast,1) (including,1) (higher-level,1) (optimized,1) (Apache,1) (in,1) (SQL,2) (system.,1) (Java,1) (of,1) (data,1) (tools,1) (cluster,1) (also,1) (graph,1) (structured,1) (execution,1) (It,2) (MLlib,1) (for,3) (Scala,1) (an,1) (computing,1) (machine,1) (supports,2) (and,5) (engine,1) (set,1) (rich,1) (Streaming.,1)
以上是关于Spark WordCount 文档词频计数的主要内容,如果未能解决你的问题,请参考以下文章