用语料库计算 tf-idf

Posted

技术标签:

【中文标题】用语料库计算 tf-idf【英文标题】:compute tf-idf with corpus 【发布时间】:2014-04-21 11:04:37 【问题描述】:

所以,我复制了一个关于如何创建可以运行 tf-idf 的系统的源代码,代码如下:

    #module import
    from __future__ import division, unicode_literals
    import math
    import string
    import re
    import os

    from text.blob import TextBlob as tb
    #create a new array
    words =  
    def tf(word, blob):
       return blob.words.count(word) / len(blob.words)

    def n_containing(word, bloblist):
       return sum(1 for blob in bloblist if word in blob)

    def idf(word, bloblist):
       return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

    def tfidf(word, blob, bloblist):
       return tf(word, blob) * idf(word, bloblist)

    regex = re.compile('[%s]' % re.escape(string.punctuation))

    f = open('D:/article/sport/a.txt','r')
    var = f.read()
    var = regex.sub(' ', var)
    var = var.lower()

    document1 = tb(var)

    f = open('D:/article/food/b.txt','r')
    var = f.read()
    var = var.lower()
    document2 = tb(var)


    bloblist = [document1, document2]
    for i, blob in enumerate(bloblist):
       print("Top words in document ".format(i + 1))
    scores = word: tfidf(word, blob, bloblist) for word in blob.words
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:50]:
    print("Word: , TF-IDF: ".format(word, round(score, 5)))

但是,问题是,我想将所有文件放在语料库中的运动文件夹中,并且 食物文件夹中的食物文章到另一个语料库中,因此系统将为每个语料库给出一个结果。现在,我只能比较文件,但我想在语料库之间进行比较。很抱歉提出这个问题,如有任何帮助,将不胜感激。

谢谢

【问题讨论】:

我不小心按下了按钮:p 【参考方案1】:

我得到的是,您想计算两个文件的词频并将它们存储在不同的文件中以进行比较,为此,您可以使用终端。下面是计算词频的简单代码

import string
import collections
import operator
keywords = []
i=0
def removePunctuation(sentence):
    sentence = sentence.lower()
    new_sentence = ""
    for char in sentence:
        if char not in string.punctuation:
                new_sentence = new_sentence + char
    return new_sentence
 def wordFrequences(sentence):
    global i
    wordFreq = 
    split_sentence = new_sentence.split()
    for word in split_sentence:
        wordFreq[word] = wordFreq.get(word,0) + 1
    wordFreq.items()
  # od = collections.OrderedDict(sorted(wordFreq.items(),reverse=True))
  # print od
    sorted_x= sorted(wordFreq.iteritems(), key=operator.itemgetter(1),reverse = True)
    print sorted_x
    for key, value in sorted_x:
        keywords.append(key)
    print keywords
f = open('D:/article/sport/a.txt','r')
sentence = f.read()
# sentence = "The first test of the function some some some some"
new_sentence = removePunctuation(sentence)
wordFrequences(new_sentence)

您必须通过更改文本文件的路径来运行此代码两次,并且每次从这样的控制台传递命令运行代码时

python abovecode.py > destinationfile.txt

就像你的情况

python abovecode.py > sportfolder/file1.txt
python abovecode.py > foodfolder/file2.txt

imp : 如果你想要单词的频率,那么省略部分

print keywords

imp : 如果你需要单词的话。到他们的频率,然后省略

print sorted_x

【讨论】:

以上是关于用语料库计算 tf-idf的主要内容,如果未能解决你的问题,请参考以下文章

TF-IDF算法--关键词句和文本集中每篇文章相关度计算

具有大或小的语料库大小的 Tf-idf

NLP探究TF-IDF的原理

TF-IDF算法原理及其使用详解

NLP入门探究TF-IDF的原理

如何使用 tf-idf 选择停用词? (非英语语料库)