相当于Matlab的聚类质量函数?

Posted

技术标签:

【中文标题】相当于Matlab的聚类质量函数?【英文标题】:Equivalent of Matlab's cluster quality function? 【发布时间】:2011-10-02 10:24:37 【问题描述】:

MATLAB 有一个很好的 silhouette function 来帮助评估 k-means 的集群数量。 Python 的 Numpy/Scipy 是否也有等价物?

【问题讨论】:

【参考方案1】:

我在下面展示了一个在 MATLAB 和 Python/Numpy 中实现的示例 silhouette(请记住,我在 MATLAB 中更流利):

1) MATLAB

function s = mySilhouette(X, IDX)
    %# X  : matrix of size N-by-p, data where rows are instances
    %# IDX: vector of size N, cluster index of each instance (starting from 1)
    %# s  : vector of size N, silhouette score value of each instance

    N = size(X,1);            %# number of instances
    K = numel(unique(IDX));   %# number of clusters

    %# compute pairwise distance matrix
    D = squareform( pdist(X,'euclidean').^2 );

    %# indices belonging to each cluster
    kIndices = accumarray(IDX, 1:N, [K 1], @(x)sort(x));

    %# compute a,b,s for each instance
    %# a(i): average distance from i to all other data within the same cluster.
    %# b(i): lowest average dist from i to the data of another single cluster
    a = zeros(N,1);
    b = zeros(N,1);
    for i=1:N
        ind = kIndicesIDX(i); ind = ind(ind~=i);
        a(i) = mean( D(i,ind) );
        b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
    end
    s = (b-a) ./ max(a,b);
end

为了模拟 MATLAB 中 silhouette 函数的绘图,我们将轮廓值按簇分组,在每个簇内排序,然后水平绘制条形图。 MATLAB 添加了NaNs 以将条形与不同的集群分开,我发现简单地对条形进行颜色编码更容易:

%# sample data
load fisheriris
X = meas;
N = size(X,1);

%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);

%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], @(x)sort(x));
ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d');           %#'

h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')

%# compare against SILHOUETTE
figure, silhouette(X,IDX)


2) 蟒蛇

这是我在 Python 中提出的:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm

def silhouette(X, cIDX):
    """
    Computes the silhouette score for each instance of a clustered dataset,
    which is defined as:
        s(i) = (b(i)-a(i)) / maxa(i),b(i)
    with:
        -1 <= s(i) <= 1

    Args:
        X    : A M-by-N array of M observations in N dimensions
        cIDX : array of len M containing cluster indices (starting from zero)

    Returns:
        s    : silhouette value of each observation
    """

    N = X.shape[0]              # number of instances
    K = len(np.unique(cIDX))    # number of clusters

    # compute pairwise distance matrix
    D = squareform(pdist(X))

    # indices belonging to each cluster
    kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]

    # compute a,b,s for each instance
    a = np.zeros(N)
    b = np.zeros(N)
    for i in range(N):
        # instances in same cluster other than instance itself
        a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
        # instances in other clusters, one cluster at a time
        b[i] = np.min( [np.mean(D[i][ind]) 
                        for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
    s = (b-a)/np.maximum(a,b)

    return s

def main():
    # load Iris dataset
    data = datasets.load_iris()
    X = data['data']

    # cluster and compute silhouette score
    K = 3
    C, cIDX = kmeans2(X, K)
    s = silhouette(X, cIDX)

    # plot
    order = np.lexsort((-s,cIDX))
    indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
    ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
    ytickLabels = ["%d" % x for x in range(K)]
    cmap = cm.jet( np.linspace(0,1,K) ).tolist()
    clr = [cmap[i] for i in cIDX[order]]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.barh(range(X.shape[0]), s[order], height=1.0, 
            edgecolor='none', color=clr)
    ax.set_ylim(ax.get_ylim()[::-1])
    plt.yticks(ytick, ytickLabels)
    plt.xlabel('Silhouette Value')
    plt.ylabel('Cluster')
    plt.show()

if __name__ == '__main__':
    main()


更新:

正如其他人所指出的,scikit-learn 此后添加了自己的silhouette metricimplementation。要在上面的代码中使用它,将对自定义silhouette函数的调用替换为:

from sklearn.metrics import silhouette_samples

...

#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX)    # <-- scikit-learn function

...

其余代码仍可按原样使用以生成完全相同的图。

【讨论】:

对不起,我没有早点回过头来。非常感谢您花这么多时间在这上面。真的很感激。从我对数据的初步运行来看,结果看起来非常好! 嗨,阿姆罗。我想知道D = squareform( pdist(X,'euclidean').^2 )是什么意思我有5行3列,D给了我5行5列。它是什么公式?这个怎么运作?或者你能告诉我一些关于这个计算的源链接吗?谢谢你。 :)【参考方案2】:

我看过,但我找不到 numpy/scipy 剪影函数,我什至在 pylab 和 matplotlib 中查看。我认为你必须自己实现它。

我可以将您指向http://orange.biolab.si/trac/browser/trunk/orange/orngClustering.py?rev=7462。它有一些实现剪影功能的功能。

希望这会有所帮助。

【讨论】:

【参考方案3】:

这有点晚了,但值得一提的是,scikits-learn 现在似乎实现了剪影功能。见their documentation page或直接查看source code。

【讨论】:

以上是关于相当于Matlab的聚类质量函数?的主要内容,如果未能解决你的问题,请参考以下文章

MATLAB | kmeans聚类如何绘制更强的聚类边界(决策边界)

MATLAB | kmeans聚类如何绘制更强的聚类边界(决策边界)

matlab kmeans函数

[matlab] 18.matlab自带kmeans函数的求点集的重心

Matlab中的聚类

Matlab:kmeans聚类给出了意想不到的聚类