使用 Python 的 Scikit-Learn 库对文本数据进行聚类并绘图

Posted 2023-03-12

技术标签:

【中文标题】使用 Python 的 Scikit-Learn 库对文本数据进行聚类并绘图【英文标题】：Clustering text data with Python's Scikit-Learn lib and plotting 【发布时间】：2019-12-25 16:12:14 【问题描述】：

我是聚类的新手，我正在学习文本聚类。我找到了一种制作集群的方法，现在我试图找到一种方法来绘制它们。这是我想绘制集群时遇到的错误：

ValueError: setting an array element with a sequence.

这是我的代码：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing'
     'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
x = cv.fit_transform(x)    

my_list = []

for i in range(1,8):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)

plt.plot(range(1,8),my_list)
plt.show()


kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)

plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()

我做错了什么，我想看看每个集群中分组了哪些句子，甚至可以这样绘制吗？如何测试发现的集群的重要性？

【问题讨论】：

到你的第二点：***.com/questions/43784903/… 【参考方案1】：

最初你的观察是句子。在对它们应用 CountVectorizer 之后，您的观察结果现在是 62 维向量。你从 pyplot 得到一个值错误（我不清楚你想绘制什么，因为你的向量是这么高的维度）。

据我所知，您的模型将对代词（“this”、“that”等）过于敏感。许多模型删除了这些和其他stop words

【讨论】：

感谢您对停用词的回答。我想知道是否有可能绘制这样的图来表示图表上的句子/单词集群你的向量 y_kmeans 有你每个句子的簇号。您可以使用它来查看每个集群中正在重新组合哪些句子那怎么看？所以如果我添加stop_words = 'english'，它会自动删除没有“价值”/“意义”的词吗？我想从我的集群中绘制句子组

以上是关于使用 Python 的 Scikit-Learn 库对文本数据进行聚类并绘图的主要内容，如果未能解决你的问题，请参考以下文章