如何绘制文本簇?
Posted
技术标签:
【中文标题】如何绘制文本簇?【英文标题】:How to plot text clusters? 【发布时间】:2019-12-28 18:41:06 【问题描述】:我已经开始学习使用 Python 和 sklearn
库进行聚类。我编写了一个用于聚类文本数据的简单代码。
我的目标是找到相似句子的组/集群。
我曾尝试绘制它们,但失败了。
问题是文本数据,我总是得到这个错误:
ValueError: setting an array element with a sequence.
同样的方法适用于数字数据,但不适用于文本数据。
有没有办法绘制相似句子的组/集群?
另外,有没有办法查看这些组是什么,这些组代表什么,我如何识别它们?
我打印了labels = kmeans.predict(x)
,但这些只是数字列表,它们代表什么?
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']
cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
#x_test = cv.transform(x_test)
my_list = []
for i in range(1,11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
labels = kmeans.predict(x) #this prints the array of numbers
print(labels)
plt.plot(range(1,11),my_list)
plt.show()
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)
plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()
【问题讨论】:
【参考方案1】:这个问题有几个动人的部分:
-
如何将文本矢量化为 kmeans 聚类可以理解的数据
如何在二维空间中绘制集群
如何按源语句标记图
我的解决方案遵循一种非常常见的方法,即使用 kmeans 标签作为散点图的颜色。 (拟合后的kmeans值只有0、1、2、3和4,表示每个句子被分配到哪个任意组。输出与原始样本的顺序相同。)关于如何将点一分为二维空间,我使用主成分分析(PCA)。请注意,我对完整数据执行 kmeans 聚类,而不是降维输出。然后我使用 matplotlib 的 ax.annotate() 用原始句子装饰我的情节。 (我还将图表放大,以便点之间有空间。)我可以根据要求进一步评论。
import pandas as pd
import re
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']
cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')
vectors = cv.fit_transform(x)
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
kmean_indices = kmeans.fit_predict(vectors)
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(vectors.toarray())
colors = ["r", "b", "c", "y", "m" ]
x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))
ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])
for i, txt in enumerate(x):
ax.annotate(txt, (x_axis[i], y_axis[i]))
【讨论】:
还有任何关键字组? 这是一个单独的问题。如果你发布它,我会回复 好的,谢谢马修!如果你愿意,你也可以用这个***.com/questions/57669503/…来治愈我 另外,这是关于关键字的问题:***.com/questions/57675486/… @taga 链接已损坏【参考方案2】:根据matplotlib.pyplot.scatter
的documentation 将数组作为输入但
在您的情况下 x[y_kmeans == a,b]
您输入的是稀疏矩阵,因此您需要使用 .toarray()
方法将其转换为 numpy 数组。我在下面修改了您的代码:
修改
plt.scatter(x[y_kmeans == 0,0].toarray(), x[y_kmeans==0,1].toarray(), s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0].toarray(), x[y_kmeans==1,1].toarray(), s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0].toarray(), x[y_kmeans==2,1].toarray(), s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0].toarray(), x[y_kmeans==3,1].toarray(), s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0].toarray(), x[y_kmeans==4,1].toarray(), s = 15, c= 'magenta', label = 'Cluster_5')
输出
希望这会有所帮助!
【讨论】:
好的,但是您的组代表什么,我如何识别它们? 如果你想对文本进行聚类,那么我建议你看看更高级的技术,比如主题建模,你可以在这里查看它们,towardsdatascience.com/…以上是关于如何绘制文本簇?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用matlab在K-means算法后绘制具有不同颜色簇的PCA散点图?