如何从 gensim 打印 LDA 主题模型? Python

Posted

技术标签:

【中文标题】如何从 gensim 打印 LDA 主题模型? Python【英文标题】:How to print the LDA topics models from gensim? Python 【发布时间】:2013-02-07 14:08:55 【问题描述】:

使用gensim,我能够从 LSA 中的一组文档中提取主题,但是如何访问从 LDA 模型生成的主题?

打印lda.print_topics(10) 时,代码出现以下错误,因为print_topics() 返回NoneType

Traceback (most recent call last):
  File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module>
    for top in lda.print_topics(2):
TypeError: 'NoneType' object is not iterable

代码:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# I can print out the topics for LSA
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus]

for l,t in izip(corpus_lsi,corpus):
  print l,"#",t
print
for top in lsi.print_topics(2):
  print top

# I can print out the documents and which is the most probable topics for each doc.
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]

for l,t in izip(corpus_lda,corpus):
  print l,"#",t
print

# But I am unable to print out the topics, how should i do it?
for top in lda.print_topics(10):
  print top

【问题讨论】:

您的代码中缺少某些内容,即 corpus_tfidf 计算。请您添加剩余的部分吗? 【参考方案1】:

经过一番折腾,print_topics(numoftopics)ldamodel 似乎有一些错误。所以我的解决方法是使用print_topic(topicid):

>>> print lda.print_topics()
None
>>> for i in range(0, lda.num_topics-1):
>>>  print lda.print_topic(i)
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system
...

【讨论】:

print_topics 是前五个主题的show_topics 的别名。只需写lda.show_topics(),不需要print @alvas 你能解释一下输出中的值是什么。例如响应为 0.083;用简单的英语是什么意思。谢谢【参考方案2】:

我认为 show_topics 的语法随着时间的推移发生了变化:

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

对于 num_topics 个主题,返回 num_words 个最重要的词(每个主题 10 个词,默认情况下)。

主题以列表形式返回——如果格式化为 True,则返回字符串列表;如果 False,则返回(概率,单词)2 元组列表。

如果 log 为 True,也将此结果输出到 log。

与 LSA 不同,LDA 中的主题之间没有自然顺序。因此,返回的所有主题的 num_topics

【讨论】:

【参考方案3】:

我认为将主题视为单词列表总是更有帮助。以下代码 sn-p 有助于实现该目标。我假设您已经有一个名为 lda_model 的 lda 模型。

for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic:  \nWords: '.format(idx, [w[0] for w in topic]))

在上面的代码中,我决定显示属于每个主题的前 30 个单词。为简单起见,我展示了我得到的第一个主题。

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

我不太喜欢上述主题的样子,所以我通常将我的代码修改为如下所示:

for idx, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic:  \nWords: '.format(idx, '|'.join([w[0] for w in topic])))

...输出(显示的前 2 个主题)将如下所示。

Topic: 0 
Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental
Topic: 1 
Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head

【讨论】:

@alvas,很抱歉我的回答来晚了,但我很想知道您对此有何看法。【参考方案4】:

您是否使用任何日志记录? print_topics 打印到docs 中所述的日志文件。

正如@mac389 所说,lda.show_topics() 是打印到屏幕的方式。

【讨论】:

我没有使用任何日志,因为我需要立即使用主题。你是对的,lda.show_topics()lda.print_topic(i) 是要走的路。【参考方案5】:

使用 Gensim 清理它自己的主题格式。

from gensim.parsing.preprocessing import preprocess_string, strip_punctuation,
strip_numeric

lda_topics = lda.show_topics(num_words=5)

topics = []
filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

for topic in lda_topics:
    print(topic)
    topics.append(preprocess_string(topic[1], filters))

print(topics)

输出:

(0, '0.020*"business" + 0.018*"data" + 0.012*"experience" + 0.010*"learning" + 0.008*"analytics"')
(1, '0.027*"data" + 0.020*"experience" + 0.013*"business" + 0.010*"role" + 0.009*"science"')
(2, '0.026*"data" + 0.016*"experience" + 0.012*"learning" + 0.011*"machine" + 0.009*"business"')
(3, '0.028*"data" + 0.015*"analytics" + 0.015*"experience" + 0.008*"business" + 0.008*"skills"')
(4, '0.014*"data" + 0.009*"learning" + 0.009*"machine" + 0.009*"business" + 0.008*"experience"')


[
  ['business', 'data', 'experience', 'learning', 'analytics'], 
  ['data', 'experience', 'business', 'role', 'science'], 
  ['data', 'experience', 'learning', 'machine', 'business'], 
  ['data', 'analytics', 'experience', 'business', 'skills'], 
  ['data', 'learning', 'machine', 'business', 'experience']
]

【讨论】:

【参考方案6】:

这里是打印主题的示例代码:

def ExtractTopics(filename, numTopics=5):
    # filename is a pickle file where I have lists of lists containing bag of words
    texts = pickle.load(open(filename, "rb"))

    # generate dictionary
    dict = corpora.Dictionary(texts)

    # remove words with low freq.  3 is an arbitrary number I have picked here
    low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3]
    dict.filter_tokens(low_occerance_ids)
    dict.compactify()
    corpus = [dict.doc2bow(t) for t in texts]
    # Generate LDA Model
    lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics)
    i = 0
    # We print the topics
    for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20):
        i = i + 1
        print "Topic #" + str(i) + ":",
        for p, id in topic:
            print dict[int(id)],

        print ""

【讨论】:

我试图运行你的代码,我将包含 BOW 的列表传递给文本。我收到以下错误:TypeError: show_topics() got an unexpected keyword argument 'topics'【参考方案7】:

你可以使用:

for i in  lda_model.show_topics():
    print i[0], i[1]

【讨论】:

【参考方案8】:

最近,在使用 Python 3 和 Gensim 2.3.0 时遇到了类似的问题。 print_topics()show_topics() 没有给出任何错误,但也没有打印任何内容。原来show_topics() 返回一个列表。所以一个人可以简单地做:

topic_list = show_topics()
print(topic_list)

【讨论】:

【参考方案9】:

您还可以将每个主题的热门词导出到 csv 文件。 topn 控制每个主题下要导出的单词数。

import pandas as pd

top_words_per_topic = []
for t in range(lda_model.num_topics):
    top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 5)])

pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")

CSV 文件具有以下格式

Topic Word  P  
0     w1    0.004437  
0     w2    0.003553  
0     w3    0.002953  
0     w4    0.002866  
0     w5    0.008813  
1     w6    0.003393  
1     w7    0.003289  
1     w8    0.003197 
... 

【讨论】:

【参考方案10】:
****This code works fine but I want to know the topic name instead of Topic: 0 and Topic:1, How do i know which topic this word comes in**?** 



for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
        print('Topic:  \nWords: '.format(idx, [w[0] for w in topic]))

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

【讨论】:

LDA 模型没有自己定义主题,根据主题中的单词,你必须重命名它。它只是根据“bag_of_words”找到最常用的词。

以上是关于如何从 gensim 打印 LDA 主题模型? Python的主要内容,如果未能解决你的问题,请参考以下文章

初试主题模型LDA-基于python的gensim包

使用 Gensim 获得 LDA 模型的最佳主题数量的最佳方法是啥?

在 Gensim LDA 中记录主题分布

从 LDA 主题模型生成文档

用scikit-learn学习LDA主题模型

Gensim-LDA实践