对 TfidfVectorizer.fit_transform 的返回结果感到困惑

Posted

技术标签:

【中文标题】对 TfidfVectorizer.fit_transform 的返回结果感到困惑【英文标题】:Confused with the return result of TfidfVectorizer.fit_transform 【发布时间】:2018-11-27 02:34:39 【问题描述】:

我想了解更多关于 NLP 的信息。我遇到了这段代码。但是当打印结果时,我对TfidfVectorizer.fit_transform 的结果感到困惑。我熟悉 tfidf 是什么,但我不明白这些数字的含义。

import tensorflow as tf
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import io
import string
import requests
import csv
import nltk
from zipfile import ZipFile

sess = tf.Session()

batch_size = 100
max_features = 1000

save_file_name = os.path.join('smsspamcollection', 'SMSSpamCollection.csv')
if os.path.isfile(save_file_name):
    text_data = []
    with open(save_file_name, 'r') as temp_output_file:
        reader = csv.reader(temp_output_file)
        for row in reader:
            text_data.append(row)

else:
    zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    r = requests.get(zip_url)
    z = ZipFile(io.BytesIO(r.content))
    file = z.read('SMSSpamCollection')

    # Format data 
    text_data = file.decode()
    text_data = text_data.encode('ascii', errors='ignore')
    text_data = text_data.decode().split('\n')
    text_data = [x.split('\t') for x in text_data if len(x) >= 1]

    # And write to csv 
    with open(save_file_name, 'w') as temp_output_file:
        writer = csv.writer(temp_output_file)
        writer.writerows(text_data)

texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]
target = [1 if x == 'spam' else 0 for x in target]

# Normalize the text
texts = [x.lower() for x in texts]  # lower
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]  # remove punctuation
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]  # remove numbers
texts = [' '.join(x.split()) for x in texts]  # trim extra whitespace


def tokenizer(text):
    words = nltk.word_tokenize(text)
    return words


tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features)
sparse_tfidf_texts = tfidf.fit_transform(texts)
print(sparse_tfidf_texts)

输出是:

(0, 630) 0.37172623140154337 (0, 160) 0.36805562944957004 (0, 38) 0.3613966215413548 (0, 545) 0.2561101665717327 (0, 326) 0.2645280991765623 (0, 967) 0.3277447602873963 (0, 421) 0.3896274380321477 (0, 227) 0.28102915589024796 (0, 323) 0.22032541100275282 (0, 922) 0.2709848154866997 (1, 577) 0.4007895093299793 (1, 425) 0.5970064521899725 (1, 943) 0.6310763941180291 (1, 878) 0.29102173465492637 (2, 282) 0.1771481430848552 (2, 243) 0.5517018054305785 (2, 955) 0.2920174942032025 (2, 138) 0.30143666813167863 (2, 946) 0.2269933441326121 (2, 165) 0.3051095293405041 (2, 268) 0.2820392223588522 (2, 780) 0.24119626642264894 (2, 823) 0.1890454397278538 (2, 674) 0.256251970757827 (2, 874) 0.19343834015314287 : : (5569, 648) 0.24171652492226922 (5569, 123) 0.23011909339432202 (5569, 957) 0.24817919217662862 (5569, 549) 0.28583789844730134 (5569, 863) 0.3026729783085827 (5569, 844) 0.20228305447951195 (5569, 146) 0.2514415602877767 (5569, 595) 0.2463259875380789 (5569, 511) 0.3091904754885042 (5569, 230) 0.2872728684768659 (5569, 638) 0.34151390143548765 (5569, 83) 0.3464271621701711 (5570, 370) 0.4199910200421362 (5570, 46) 0.48234172093857797 (5570, 317) 0.4171646676697801 (5570, 281) 0.6456993475093024 (5572, 282) 0.25540827228532487 (5572, 385) 0.36945842040023935 (5572, 448) 0.25540827228532487 (5572, 931) 0.3031800542518209 (5572, 192) 0.29866989620926737 (5572, 303) 0.43990016711221736 (5572, 87) 0.45211284173737176 (5572, 332) 0.3924202767503492 (5573, 866) 1.0

如果有人能解释一下输出,我会非常高兴。

【问题讨论】:

【参考方案1】:

请注意,您正在打印稀疏矩阵,因此与打印标准密集矩阵相比,输出看起来不同。请参阅下面的主要组件:

元组代表:(document_id, token_id) 元组后面的值表示给定文档中给定标记的 tf-idf 分数 不存在的元组的 tf-idf 分数为 0

如果要查找token_id对应的token,请查看get_feature_names方法。

【讨论】:

现在一切都像水晶一样清晰。这很直观,现在很有意义。谢谢! 这很奇怪,没有正确记录。谢谢。

以上是关于对 TfidfVectorizer.fit_transform 的返回结果感到困惑的主要内容,如果未能解决你的问题,请参考以下文章

如何对List 进行排序

根据对的第二个值查找对向量的上限

算法之逆序对

10:素数对

向量对还是向量对?

逆序对与本质不同的逆序对