Spark成长之路-Word2Vec

Posted Q博士

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark成长之路-Word2Vec相关的知识,希望对你有一定的参考价值。

word2vec

简介

将文本映射到K维空间的向量值。

代码

object Word2VecExample 

  def main(args: Array[String]): Unit = 
    val spark = SparkSession.builder().getOrCreate()
    spark.sparkContext.setLogLevel("WARN")
    // Input data: Each row is a bag of words from a sentence or document.
    val documentDF = spark.createDataFrame(Seq(
      "Hi I heard about Spark".split(" "),
      "I wish Java could use case classes".split(" "),
      "Logistic regression models are neat".split(" ")
    ).map(Tuple1.apply)).toDF("text")

    // Learn a mapping from words to Vectors.
    val word2Vec = new Word2Vec()
      .setInputCol("text")
      .setOutputCol("result")
      .setVectorSize(6)
      .setMinCount(0)
    val model = word2Vec.fit(documentDF)
    val result = model.transform(documentDF)
    result.show()
    result.collect().foreach  case Row(text: Seq[_], features: Vector) =>
      println(s"Text: [$text.mkString(", ")] => \\nVector: $features\\n") 
  


结果

Text: [Hi, I, heard, about, Spark] => 
Vector: [0.0068203588947653776,0.017414073273539544,0.008097704406827689,-0.034566799923777584,-0.004852301999926568,0.022082760557532312]

Text: [I, wish, Java, could, use, case, classes] => 
Vector: [0.045732982855822356,-2.3274788899081092E-4,0.032252547198108265,0.0015899876930883952,-0.020712170167826116,0.016202476141708236]

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.02979586571455002,0.029230652749538424,-0.03639255976304412,-3.955196589231491E-4,-0.00870799645781517,-0.03496376480907202]

以上是关于Spark成长之路-Word2Vec的主要内容,如果未能解决你的问题,请参考以下文章

Spark成长之路(10)-CountVectorizer

Spark成长之路-TFIDF

Spark成长之路(11)-ngram

Spark成长之路-消息队列

spark成长之路spark究竟是什么?

Spark成长之路(13)-DataSet与DataFrame