Spark成长之路-TFIDF
Posted Q博士
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark成长之路-TFIDF相关的知识,希望对你有一定的参考价值。
简介
文本特征提取算法,给某个文章归档某个类别时特别有用。
源码
object TfIdfExample
def main(args: Array[String]): Unit =
val spark = SparkSession.builder().getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val sentenceData = spark.createDataFrame(Seq(
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")
//将句子切分为词语
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
wordsData.show()
// 将句子转换为特征向量
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200)
val featurizedData = hashingTF.transform(wordsData)
featurizedData.show()
// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("features", "label").show()
rescaledData.show()
输出
+-----+--------------------+--------------------+
|label| sentence| words|
+-----+--------------------+--------------------+
| 0.0|Hi I heard about ...|[hi, i, heard, ab...|
| 0.0|I wish Java could...|[i, wish, java, c...|
| 1.0|Logistic regressi...|[logistic, regres...|
+-----+--------------------+--------------------+
+-----+--------------------+--------------------+--------------------+
|label| sentence| words| rawFeatures|
+-----+--------------------+--------------------+--------------------+
| 0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...|
| 0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...|
| 1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|
+-----+--------------------+--------------------+--------------------+
+--------------------+-----+
| features|label|
+--------------------+-----+
|(200,[105,129,157...| 0.0|
|(200,[9,13,89,95,...| 0.0|
|(200,[4,86,95,138...| 1.0|
+--------------------+-----+
+-----+--------------------+--------------------+--------------------+--------------------+
|label| sentence| words| rawFeatures| features|
+-----+--------------------+--------------------+--------------------+--------------------+
| 0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...|(200,[105,129,157...|
| 0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...|(200,[9,13,89,95,...|
| 1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|(200,[4,86,95,138...|
+-----+--------------------+--------------------+--------------------+--------------------+
以上是关于Spark成长之路-TFIDF的主要内容,如果未能解决你的问题,请参考以下文章