相当于Matlab的聚类质量函数?
Posted
技术标签:
【中文标题】相当于Matlab的聚类质量函数?【英文标题】:Equivalent of Matlab's cluster quality function? 【发布时间】:2011-10-02 10:24:37 【问题描述】:MATLAB 有一个很好的 silhouette function 来帮助评估 k-means 的集群数量。 Python 的 Numpy/Scipy 是否也有等价物?
【问题讨论】:
【参考方案1】:我在下面展示了一个在 MATLAB 和 Python/Numpy 中实现的示例 silhouette(请记住,我在 MATLAB 中更流利):
1) MATLAB
function s = mySilhouette(X, IDX)
%# X : matrix of size N-by-p, data where rows are instances
%# IDX: vector of size N, cluster index of each instance (starting from 1)
%# s : vector of size N, silhouette score value of each instance
N = size(X,1); %# number of instances
K = numel(unique(IDX)); %# number of clusters
%# compute pairwise distance matrix
D = squareform( pdist(X,'euclidean').^2 );
%# indices belonging to each cluster
kIndices = accumarray(IDX, 1:N, [K 1], @(x)sort(x));
%# compute a,b,s for each instance
%# a(i): average distance from i to all other data within the same cluster.
%# b(i): lowest average dist from i to the data of another single cluster
a = zeros(N,1);
b = zeros(N,1);
for i=1:N
ind = kIndicesIDX(i); ind = ind(ind~=i);
a(i) = mean( D(i,ind) );
b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
end
s = (b-a) ./ max(a,b);
end
为了模拟 MATLAB 中 silhouette 函数的绘图,我们将轮廓值按簇分组,在每个簇内排序,然后水平绘制条形图。 MATLAB 添加了NaN
s 以将条形与不同的集群分开,我发现简单地对条形进行颜色编码更容易:
%# sample data
load fisheriris
X = meas;
N = size(X,1);
%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);
%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], @(x)sort(x));
ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d'); %#'
h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')
%# compare against SILHOUETTE
figure, silhouette(X,IDX)
2) 蟒蛇
这是我在 Python 中提出的:
import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm
def silhouette(X, cIDX):
"""
Computes the silhouette score for each instance of a clustered dataset,
which is defined as:
s(i) = (b(i)-a(i)) / maxa(i),b(i)
with:
-1 <= s(i) <= 1
Args:
X : A M-by-N array of M observations in N dimensions
cIDX : array of len M containing cluster indices (starting from zero)
Returns:
s : silhouette value of each observation
"""
N = X.shape[0] # number of instances
K = len(np.unique(cIDX)) # number of clusters
# compute pairwise distance matrix
D = squareform(pdist(X))
# indices belonging to each cluster
kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]
# compute a,b,s for each instance
a = np.zeros(N)
b = np.zeros(N)
for i in range(N):
# instances in same cluster other than instance itself
a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
# instances in other clusters, one cluster at a time
b[i] = np.min( [np.mean(D[i][ind])
for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
s = (b-a)/np.maximum(a,b)
return s
def main():
# load Iris dataset
data = datasets.load_iris()
X = data['data']
# cluster and compute silhouette score
K = 3
C, cIDX = kmeans2(X, K)
s = silhouette(X, cIDX)
# plot
order = np.lexsort((-s,cIDX))
indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
ytickLabels = ["%d" % x for x in range(K)]
cmap = cm.jet( np.linspace(0,1,K) ).tolist()
clr = [cmap[i] for i in cIDX[order]]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.barh(range(X.shape[0]), s[order], height=1.0,
edgecolor='none', color=clr)
ax.set_ylim(ax.get_ylim()[::-1])
plt.yticks(ytick, ytickLabels)
plt.xlabel('Silhouette Value')
plt.ylabel('Cluster')
plt.show()
if __name__ == '__main__':
main()
更新:
正如其他人所指出的,scikit-learn 此后添加了自己的silhouette metricimplementation。要在上面的代码中使用它,将对自定义silhouette
函数的调用替换为:
from sklearn.metrics import silhouette_samples
...
#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX) # <-- scikit-learn function
...
其余代码仍可按原样使用以生成完全相同的图。
【讨论】:
对不起,我没有早点回过头来。非常感谢您花这么多时间在这上面。真的很感激。从我对数据的初步运行来看,结果看起来非常好! 嗨,阿姆罗。我想知道D = squareform( pdist(X,'euclidean').^2 )
是什么意思我有5行3列,D给了我5行5列。它是什么公式?这个怎么运作?或者你能告诉我一些关于这个计算的源链接吗?谢谢你。 :)【参考方案2】:
我看过,但我找不到 numpy/scipy 剪影函数,我什至在 pylab 和 matplotlib 中查看。我认为你必须自己实现它。
我可以将您指向http://orange.biolab.si/trac/browser/trunk/orange/orngClustering.py?rev=7462。它有一些实现剪影功能的功能。
希望这会有所帮助。
【讨论】:
【参考方案3】:这有点晚了,但值得一提的是,scikits-learn 现在似乎实现了剪影功能。见their documentation page或直接查看source code。
【讨论】:
以上是关于相当于Matlab的聚类质量函数?的主要内容,如果未能解决你的问题,请参考以下文章
MATLAB | kmeans聚类如何绘制更强的聚类边界(决策边界)
MATLAB | kmeans聚类如何绘制更强的聚类边界(决策边界)