Spark成长之路-TFIDF

Posted Q博士

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark成长之路-TFIDF相关的知识,希望对你有一定的参考价值。

TDIDF

简介

文本特征提取算法,给某个文章归档某个类别时特别有用。

源码

object TfIdfExample 

  def main(args: Array[String]): Unit = 
    val spark = SparkSession.builder().getOrCreate()
    spark.sparkContext.setLogLevel("WARN")
    val sentenceData = spark.createDataFrame(Seq(
      (0.0, "Hi I heard about Spark"),
      (0.0, "I wish Java could use case classes"),
      (1.0, "Logistic regression models are neat")
    )).toDF("label", "sentence")

    //将句子切分为词语
    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
    val wordsData = tokenizer.transform(sentenceData)
    wordsData.show()
    // 将句子转换为特征向量
    val hashingTF = new HashingTF()
      .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200)

    val featurizedData = hashingTF.transform(wordsData)
    featurizedData.show()
    // alternatively, CountVectorizer can also be used to get term frequency vectors

    val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
    val idfModel = idf.fit(featurizedData)

    val rescaledData = idfModel.transform(featurizedData)
    rescaledData.select("features", "label").show()
    rescaledData.show()
  

输出

+-----+--------------------+--------------------+
|label|            sentence|               words|
+-----+--------------------+--------------------+
|  0.0|Hi I heard about ...|[hi, i, heard, ab...|
|  0.0|I wish Java could...|[i, wish, java, c...|
|  1.0|Logistic regressi...|[logistic, regres...|
+-----+--------------------+--------------------+

+-----+--------------------+--------------------+--------------------+
|label|            sentence|               words|         rawFeatures|
+-----+--------------------+--------------------+--------------------+
|  0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...|
|  0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...|
|  1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|
+-----+--------------------+--------------------+--------------------+

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(200,[105,129,157...|  0.0|
|(200,[9,13,89,95,...|  0.0|
|(200,[4,86,95,138...|  1.0|
+--------------------+-----+

+-----+--------------------+--------------------+--------------------+--------------------+
|label|            sentence|               words|         rawFeatures|            features|
+-----+--------------------+--------------------+--------------------+--------------------+
|  0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...|(200,[105,129,157...|
|  0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...|(200,[9,13,89,95,...|
|  1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|(200,[4,86,95,138...|
+-----+--------------------+--------------------+--------------------+--------------------+

以上是关于Spark成长之路-TFIDF的主要内容,如果未能解决你的问题,请参考以下文章

TFIDF计算

Spark成长之路(10)-CountVectorizer

Spark成长之路(11)-ngram

Spark成长之路-消息队列

spark成长之路spark究竟是什么?

Spark成长之路(13)-DataSet与DataFrame