sklearn中CountVectorizer与TfidfVectorizer区别

Posted bitcarmanlee

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了sklearn中CountVectorizer与TfidfVectorizer区别相关的知识,希望对你有一定的参考价值。

1.CountVectorizer

首先我们看看CountVectorizer相关源码中的部分内容。

class CountVectorizer(_VectorizerMixin, BaseEstimator):
    """Convert a collection of text documents to a matrix of token counts

    This implementation produces a sparse representation of the counts using
    scipy.sparse.csr_matrix.

    If you do not provide an a-priori dictionary and you do not use an analyzer
    that does some kind of feature selection then the number of features will
    be equal to the vocabulary size found by analyzing the data.

    Read more in the :ref:`User Guide <text_feature_extraction>`.

注释的前面两行就指出了CountVectorizer最核心的两点
Convert a collection of text documents to a matrix of token counts
CountVectorizer把一个文档转成一个包含词频的矩阵。
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
最后的词频矩阵是用csr_matrix这种稀疏矩阵的表示方式来表示的。

用一个简单的demo测试一下

from sklearn.feature_extraction.text import CountVectorizer

def t1():
    cv = CountVectorizer()
    train = ["Chinese Beijing Chinese",
             "Chinese Chinese Shanghai",
             "Chinese Macao",
             "Tokyo Japan Chinese"]
    cv_fit = cv.fit_transform(train)
    print(cv.get_feature_names())
    print(cv_fit)
    print(cv_fit.toarray())


t1()

最后的输出结果

['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']
  (0, 1)	2
  (0, 0)	1
  (1, 1)	2
  (1, 4)	1
  (2, 1)	1
  (2, 3)	1
  (3, 1)	1
  (3, 5)	1
  (3, 2)	1
[[1 2 0 0 0 0]
 [0 2 0 0 1 0]
 [0 1 0 1 0 0]
 [0 1 1 0 0 1]]

首先所有的文档中有6个词,所以最后get_feature_names得到的结果为6维列表。
cv_fit很明显可以看出来就是使用csr_matrix这种方式来存储的,(0,1)对应的是第一行第二个词即chinese,后面的2表示第一行chinese这个词出现了2次。
如果调用toarray方法,会将矩阵由稀疏表示转化为正常矩阵,因为所有文档中包含6个词,所以每一行文档会有6维。

2.TfidfVectorizer

class TfidfVectorizer(CountVectorizer):
    """Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to :class:`CountVectorizer` followed by
    :class:`TfidfTransformer`.

    Read more in the :ref:`User Guide <text_feature_extraction>`.

TfidfVectorizer跟CountVectorizer的区别在于:
CountVectorizer返回的是词频,TfidfVectorizer返回的是tfidf值。

from sklearn.feature_extraction.text import TfidfVectorizer

def t2():
    tf = TfidfVectorizer(use_idf=True, smooth_idf=True, norm=None)
    train = ["Chinese Beijing Chinese",
             "Chinese Chinese Shanghai",
             "Chinese Macao",
             "Tokyo Japan Chinese"]
    tf_fit = tf.fit_transform(train)
    print(tf.get_feature_names())
    print(tf_fit)
    print(tf_fit.toarray())


t2()
['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']
  (0, 0)	1.916290731874155
  (0, 1)	2.0
  (1, 4)	1.916290731874155
  (1, 1)	2.0
  (2, 3)	1.916290731874155
  (2, 1)	1.0
  (3, 2)	1.916290731874155
  (3, 5)	1.916290731874155
  (3, 1)	1.0
[[1.91629073 2.         0.         0.         0.         0.        ]
 [0.         2.         0.         0.         1.91629073 0.        ]
 [0.         1.         0.         1.91629073 0.         0.        ]
 [0.         1.         1.91629073 0.         0.         1.91629073]]

3.sklearn中idf的计算方法

TfidfVectorizer中计算tfidf值的核心代码调用如下

self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,
                                       smooth_idf=smooth_idf,
                                       sublinear_tf=sublinear_tf)

进入到TfidfTransformer中,查看源码观察具体计算逻辑

    def __init__(self, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False):
        self.norm = norm
        self.use_idf = use_idf
        self.smooth_idf = smooth_idf
        self.sublinear_tf = sublinear_tf

    def fit(self, X, y=None):
        """Learn the idf vector (global term weights)

        Parameters
        ----------
        X : sparse matrix, [n_samples, n_features]
            a matrix of term/token counts
        """
        X = check_array(X, accept_sparse=('csr', 'csc'))
        if not sp.issparse(X):
            X = sp.csr_matrix(X)
        dtype = X.dtype if X.dtype in FLOAT_DTYPES else np.float64

        if self.use_idf:
            n_samples, n_features = X.shape
            df = _document_frequency(X)
            df = df.astype(dtype, **_astype_copy_false(df))

            # perform idf smoothing if required
            df += int(self.smooth_idf)
            n_samples += int(self.smooth_idf)

            # log+1 instead of log makes sure terms with zero idf don't get
            # suppressed entirely.
            idf = np.log(n_samples / df) + 1
            self._idf_diag = sp.diags(idf, offsets=0,
                                      shape=(n_features, n_features),
                                      format='csr',
                                      dtype=dtype)

        return self

根据上面的代码不难看出,idf的具体计算方法为
当smooth_idf参数为true时
i d f = l o g 1 + n d 1 + d f + 1 idf = log \\frac{1+n_d}{1+ df} + 1 idf=log1+df1+nd+1
其中, n d n_d nd为总文档数量,df为某个词出现的文档数量。
而当smooth_idf参数为false时
i d f = l o g n d d f + 1 idf = log \\frac{n_d}{df} + 1 idf=logdfnd+1

4.csr_matrix解析

前面说到了csr_matrix表示方法,顺便温习一下csr_matrix相关知识点。
csr_matrix(Compressed Sparse Row matrix)为稀疏矩阵的一种表示方式,对应的是csc_matric(Compressed Sparse Column marix)。

CSR方法采取按行压缩的办法, 将原始的矩阵用三个数组进行表示

def csr_data():
    from scipy import sparse
    import numpy as np
    data = np.array([1, 2, 3, 4, 5, 6])
    indices = np.array([0, 2, 2, 0, 1, 2])
    indptr = np.array([0, 2, 3, 6])
    matrix = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
    print(matrix)
    print()
    print(matrix.todense())

csr_data()

结果为

  (0, 0)	1
  (0, 2)	2
  (1, 2)	3
  (2, 0)	4
  (2, 1)	5
  (2, 2)	6

[[1 0 2]
 [0 0 3]
 [4 5 6]]

其中,data为所有的非零数值
indices为所有非零值的列索引
indptr为每行的非零数据起止索引

def csc_data():
    from scipy import sparse
    import numpy as np
    data = np.array([1, 2, 3, 4, 5, 6])
    indices = np.array([0, 2, 2, 0, 1, 2])
    indptr = np.array([0, 2, 3, 6])
    matrix = sparse.csc_matrix((data, indices, indptr), shape=(3, 3))
    print(matrix)
    print()
    print(matrix.todense())

csc_data()

结果为

  (0, 0)	1
  (2, 0)	2
  (2, 1)	3
  (0, 2)	4
  (1, 2)	5
  (2, 2)	6

[[1 0 4]
 [0 0 5]
 [2 3 6]]

csc_matrix与csr_matrix唯一的区别在于,csr的indptr是针对行,而csc的indptr是针对列。

以上是关于sklearn中CountVectorizer与TfidfVectorizer区别的主要内容,如果未能解决你的问题,请参考以下文章

Sklearn:将 lemmatizer 添加到 CountVectorizer

了解 sklearn 中 CountVectorizer 中的“ngram_range”参数

在 pandas 数据框中插入 sklearn CountVectorizer 的结果

如何为 sklearn CountVectorizer 设置自定义停用词?

Sklearn CountVectorizer:将表情符号保留为单词

sklearn 中的 countvectorizer 仅适用于英语吗?