Spark成长之路(10)-CountVectorizer

Posted 2022-12-12 Q博士

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Spark成长之路(10)-CountVectorizer相关的知识，希望对你有一定的参考价值。

CountVectorizer

简介

用文档中单个单词出现的次数组成一个向量。

代码

object CountVectorizerExample 
  def main(args: Array[String]): Unit = 
    val spark = SparkSession.builder().getOrCreate()
    val df = spark.createDataFrame(Seq(
      (0, Array("a", "b", "c")),
      (1, Array("a", "b", "b", "c", "a", "a"))
    )).toDF("id", "words")

    // fit a CountVectorizerModel from the corpus
    val cvModel: CountVectorizerModel = new CountVectorizer()
      .setInputCol("words")
      .setOutputCol("features")
      .setVocabSize(3)
      .setMinDF(2)
      .fit(df)

    // alternatively, define CountVectorizerModel with a-priori vocabulary
    val cvm = new CountVectorizerModel(Array("a", "b", "c", "c"))
      .setInputCol("words")
      .setOutputCol("features")


    cvModel.transform(df).show(false)

输出

+---+------------------+-------------------------+
|id |words             |features                 |
+---+------------------+-------------------------+
|0  |[a, b, c]         |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a, a]|(3,[0,1,2],[3.0,2.0,1.0])|
+---+------------------+-------------------------+

以上是关于Spark成长之路(10)-CountVectorizer的主要内容，如果未能解决你的问题，请参考以下文章