TfidfVectorizer.fit_transfrom 和 tfidf.transform 有啥区别?

Posted

技术标签:

【中文标题】TfidfVectorizer.fit_transfrom 和 tfidf.transform 有啥区别?【英文标题】:What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?TfidfVectorizer.fit_transfrom 和 tfidf.transform 有什么区别? 【发布时间】:2019-04-01 08:15:14 【问题描述】:

在 Tfidf.fit_transform 中,我们仅使用参数 X 并没有使用 y 来拟合数据集。 这是正确的吗? 我们只为训练集的参数生成 tfidf 矩阵。我们没有使用 ytrain 来拟合模型。 那么我们如何对测试数据集进行预测

【问题讨论】:

datascience.stackexchange.com/a/12346/122 =) TfidfVectorizer 不用于预测,这就是我们不使用y_train 的原因。无论是在拟合期间还是在转换期间。 【参考方案1】:

https://datascience.stackexchange.com/a/12346/122 很好地解释了为什么调用 fit()transform()fit_transform()

总之,

fit():将矢量化器/模型拟合到训练数据并将矢量化器/模型保存到变量中(返回sklearn.feature_extraction.text.TfidfVectorizer

transform():使用来自fit()的变量输出到转换器验证/测试数据(返回scipy.sparse.csr.csr_matrix

fit_transform():有时你直接转换训练数据,所以你使用fit() + transform()一起,因此fit_transform()。 (返回scipy.sparse.csr.csr_matrix


例如

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix


# The *TfidfVectorizer* from sklearn expects list of strings as input.
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower()
sent1 = "Mr brown jumps over the lazy fox .".lower()
sent2 = "Roses are red , the chocolates are brown .".lower()
sent3 = "The frank dog jumps through the red roses .".lower()

dataset = [sent0, sent1, sent2, sent3]

# Initialize the parameters of the vectorizer
vectorizer = TfidfVectorizer(input=dataset, analyzer='word', ngram_range=(1,1),
                     min_df = 0, stop_words=None)

[出]:

# Learns the vocabulary of vectorizer based on the initialized parameter.
>>> vectorizer =  vectorizer.fit(dataset)

# Apply the vectorizer to new sentence.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."])
<1x15 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

# Output to array form.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray()
array([[0.        , 0.31342551, 0.        , 0.38714286, 0.        ,
        0.        , 0.31342551, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.38714286, 0.51249178, 0.49104163]])

# When you don't need to save the vectorizer for re-using.
>>> vectorizer.fit_transform(dataset)
<4x15 sparse matrix of type '<class 'numpy.float64'>'
    with 28 stored elements in Compressed Sparse Row format>

>>> vectorizer.fit_transform(dataset).toarray()
array([[0.        , 0.49642852, 0.        , 0.30659399, 0.30659399,
        0.        , 0.24821426, 0.30659399, 0.        , 0.30659399,
        0.38887561, 0.        , 0.        , 0.40586285, 0.        ],
       [0.        , 0.32107915, 0.        , 0.        , 0.39659663,
        0.        , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
        0.        , 0.        , 0.        , 0.26250325, 0.        ],
       [0.76012588, 0.24258925, 0.38006294, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.29964599, 0.29964599, 0.19833261, 0.        ],
       [0.        , 0.        , 0.        , 0.34049544, 0.        ,
        0.4318753 , 0.27566041, 0.        , 0.        , 0.        ,
        0.        , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])


>>> type(vectorizer)
<class 'sklearn.feature_extraction.text.TfidfVectorizer'>

>>> type(vectorizer.fit_transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

>>> type(vectorizer.transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

【讨论】:

感谢您的解释。您说 fit_transform 不存储模型,但您发布的链接显示它存储模型。 啊,是的,对不起,我错过了这些信息。它不返回模型,但矢量化器仍然存储它 =)

以上是关于TfidfVectorizer.fit_transfrom 和 tfidf.transform 有啥区别?的主要内容,如果未能解决你的问题,请参考以下文章