使用卡方检验列出语料库中拒绝原假设的所有单词

Posted 2023-03-12

技术标签:

【中文标题】使用卡方检验列出语料库中拒绝原假设的所有单词【英文标题】：List all the words in corpus that reject null hypothesis with chi-squared test 【发布时间】：2019-07-26 16:23:52 【问题描述】：

我有一个列出前 n 个单词（具有较高卡方值的单词）的脚本。但是，我不想提取固定的 n 个单词，而是要提取 p 值小于 0.05 的所有单词，即拒绝原假设。

这是我的代码：

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2score = chi2(X_tfidf, y)[0]
scores = list(zip(tfidf.get_feature_names(), chi2score))
chi2 = sorted(scores, key=lambda x:x[1])
allchi2 = list(zip(*chi2))

#lists top 20 words
allchi2 = allchi2[0][-20:]

因此，在这种情况下，我不希望列出前 20 个单词，而是希望所有拒绝零假设的单词，即评论中依赖于情绪类别（正面或负面）的所有单词

【问题讨论】：

问题与keras 无关 - 请不要向无关标签发送垃圾邮件（已删除）。 【参考方案1】：

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2_score, pval_score = chi2(X_tfidf, y)
feature_pval_items = filter(lambda x:x[1]<0.05, zip(tfidf.get_feature_names(), pval_score))
you_want_feature_pval_items = sorted(feature_pval_items, key=lambda x:x[1])

【讨论】：

以上是关于使用卡方检验列出语料库中拒绝原假设的所有单词的主要内容，如果未能解决你的问题，请参考以下文章