Spark成长之路(10)-CountVectorizer

Posted Q博士

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark成长之路(10)-CountVectorizer相关的知识,希望对你有一定的参考价值。

CountVectorizer

简介

用文档中单个单词出现的次数组成一个向量。

代码

object CountVectorizerExample 
  def main(args: Array[String]): Unit = 
    val spark = SparkSession.builder().getOrCreate()
    val df = spark.createDataFrame(Seq(
      (0, Array("a", "b", "c")),
      (1, Array("a", "b", "b", "c", "a", "a"))
    )).toDF("id", "words")

    // fit a CountVectorizerModel from the corpus
    val cvModel: CountVectorizerModel = new CountVectorizer()
      .setInputCol("words")
      .setOutputCol("features")
      .setVocabSize(3)
      .setMinDF(2)
      .fit(df)

    // alternatively, define CountVectorizerModel with a-priori vocabulary
    val cvm = new CountVectorizerModel(Array("a", "b", "c", "c"))
      .setInputCol("words")
      .setOutputCol("features")


    cvModel.transform(df).show(false)
  


输出

+---+------------------+-------------------------+
|id |words             |features                 |
+---+------------------+-------------------------+
|0  |[a, b, c]         |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a, a]|(3,[0,1,2],[3.0,2.0,1.0])|
+---+------------------+-------------------------+

以上是关于Spark成长之路(10)-CountVectorizer的主要内容,如果未能解决你的问题,请参考以下文章

Spark成长之路-TFIDF

Spark成长之路(11)-ngram

Spark成长之路-消息队列

spark成长之路spark究竟是什么?

Spark成长之路(13)-DataSet与DataFrame

Spark成长之路-Word2Vec