大数据基础之词频统计Word Count

Posted 2021-01-26 Thinking in BigData

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了大数据基础之词频统计Word Count相关的知识，希望对你有一定的参考价值。

对文件进行词频统计，是一个大数据领域的hello word级别的应用，来看下实现有多简单：

1 Linux单机处理

egrep -o "[[:alpha:]]+" test_word.log|sort|uniq -c|sort -rn|head -10

2 Spark分布式处理（Scala优雅简洁）

val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
sc.textFile("test_word.log").flatMap(_.split("\s+")).map((_, 1)).reduceByKey(_ + _).sortBy(_._2, false).take(10).foreach(println)

3 Hadoop示例

hadoop jar /path/hadoop-2.6.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.1.jar wordcount /tmp/wordcount/input /tmp/wordcount/output

附：测试文件test_word.log内容如下：

hello world
hello www

输出如下：

2 hello
1 world
1 www

以上是关于大数据基础之词频统计Word Count的主要内容，如果未能解决你的问题，请参考以下文章