如何在 Jupyter Notebook 中绘制假新闻中最常见的 30 个单词的条形图

Posted

技术标签:

【中文标题】如何在 Jupyter Notebook 中绘制假新闻中最常见的 30 个单词的条形图【英文标题】:How to draw bar plot of 30 most common words found in fake news in Jupyter Notebook 【发布时间】:2020-04-07 22:44:03 【问题描述】:

我有一个 Python 代码,可以将一条新闻分类为假新闻或真新闻。 TfidfVectorizer 用于清理数据,Passive Aggressive Classifier 用于对假新闻检测器进行建模。有人能告诉我应该使用哪一行代码来显示假新闻和真实新闻中最常用的 30 个词吗?以及如何绘制条形图来显示这些词的频率?

%matplotlib inline
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import itertools
import json
import csv
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier  
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

df = pd.read_csv(r".\fake_news(1).csv", sep=',', header=0, engine='python', escapechar='\\')
#print(df)
#df.shape
df.head()
#df.head().to_dict()

headline1 = df.headline
headline1.head()

trainx, testx, trainy, testy = train_test_split(df['headline'], is_sarcastic_1, test_size = 0.2, random_state = 7)

tvector = TfidfVectorizer(strip_accents='ascii', stop_words='english', max_df=0.5)
ttrain = tvector.fit_transform(trainx)
ttest = tvector.transform(testx)

pac = PassiveAggressiveClassifier(max_iter=100)
pac.fit(ttrain, trainy)

y_pred = pac.predict(ttest)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: round(score*100,2)%')

corpus = ['dem rep. totally nails why congress is falling short on gender, racial equality',
  'eat your veggies: 9 deliciously different recipes',
'inclement weather prevents liar from getting to work',
"mother comes pretty close to using word 'streaming' correctly"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

【问题讨论】:

这应该会有所帮助:***.com/questions/34232190/… 请展示您到目前为止所做的工作,否则我们将无法为您提供帮助。发布您如何提取 tfidf 分数等。 Tiago,我现在已经发布了整个代码。 【参考方案1】:

您需要了解.fit_transform(corpus) 之后返回的内容。这是一个矩阵,其中行是语料库中的句子,列是单词,也就是特征。值是单词/特征 Tfidf,请注意这些不是单词的计数(阅读https://en.wikipedia.org/wiki/Tf%E2%80%93idf)。因此,为了找到整个语料库的单词/特征 Tfidf,您只需对列求和。

import numpy as np
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()

corpus = ['dem rep. totally nails why congress is falling short on gender, racial equality',
  'eat your veggies: 9 deliciously different recipes',
'inclement weather prevents liar from getting to work',
"mother comes pretty close to using word 'streaming' correctly"]

X = vect.fit_transform(corpus)

# zipping actual words and sum of their Tfidf for corpus
features_rank = list(zip(vect.get_feature_names(), [x[0] for x in X.sum(axis=0).T.tolist()]))

# sorting
features_rank = np.array(sorted(features_rank, key=lambda x:x[1], reverse=True))

n = 10
plt.figure(figsize=(5, 10))
plt.barh(-np.arange(n), features_rank[:n, 1].astype(float), height=.8)
plt.yticks(ticks=-np.arange(n), labels=features_rank[:n, 0])

【讨论】:

MjH,谢谢,非常感谢!是的,你说得对,我需要了解函数(方法)的实际含义。

以上是关于如何在 Jupyter Notebook 中绘制假新闻中最常见的 30 个单词的条形图的主要内容,如果未能解决你的问题,请参考以下文章

Jupyter Notebook - 在函数内绘图 - 未绘制图 [重复]

在 Jupyter Notebook 中使用 matplotlib 绘制动态变化的图形

在 jupyter notebook 中使用 plotly python 绘制具有不等热图的交互式树状图

使用 Jupyter Notebook 绘制 rosbag 文件中的数据

机械学习:Jupyter Notebook中Matplotlib的使用

如何修改jupyter notebook的默认工作路径