Spark成长之路(10)-CountVectorizer
Posted Q博士
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark成长之路(10)-CountVectorizer相关的知识,希望对你有一定的参考价值。
简介
用文档中单个单词出现的次数组成一个向量。
代码
object CountVectorizerExample
def main(args: Array[String]): Unit =
val spark = SparkSession.builder().getOrCreate()
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c", "c"))
.setInputCol("words")
.setOutputCol("features")
cvModel.transform(df).show(false)
输出
+---+------------------+-------------------------+
|id |words |features |
+---+------------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a, a]|(3,[0,1,2],[3.0,2.0,1.0])|
+---+------------------+-------------------------+
以上是关于Spark成长之路(10)-CountVectorizer的主要内容,如果未能解决你的问题,请参考以下文章