我对 Davies-Bouldin Index 的 python 实现是不是正确？

Posted 2023-03-12

技术标签:

【中文标题】我对 Davies-Bouldin Index 的 python 实现是不是正确？【英文标题】：Is my python implementation of the Davies-Bouldin Index correct?我对 Davies-Bouldin Index 的 python 实现是否正确？ 【发布时间】：2018-06-10 17:17:12 【问题描述】：

我正在尝试在 Python 中计算 Davies-Bouldin Index。

以下是下面的代码尝试重现的步骤。

5 个步骤：

对于每个集群，计算每个点到质心之间的欧几里得距离对于每个集群，计算这些距离的平均值对于每对聚类，计算其质心之间的欧几里得距离

那么，

对于每对聚类，求其到各自质心的平均距离之和（在步骤 2 中计算），然后除以它们之间的距离（在步骤 3 中计算）。

最后，

计算所有这些划分的平均值（= 所有索引）以获得整个聚类的 Davies-Bouldin 索引

代码

def daviesbouldin(X, labels, centroids):

    import numpy as np
    from scipy.spatial.distance import pdist, euclidean

    nbre_of_clusters = len(centroids) #Get the number of clusters
    distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
    distances_means = [] #Store the mean of these distances
    DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
    second_cluster_idx = [] #Store index of the second cluster of each pair
    first_cluster_idx = 0 #Set index of first cluster of each pair to 0

    # Step 1: Compute euclidean distances between each point of a cluster to their centroid
    for cluster in range(nbre_of_clusters):
        for point in range(X[labels == cluster].shape[0]):
            distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))

    # Step 2: Compute the mean of these distances
    for e in distances:
        distances_means.append(np.mean(e))

    # Step 3: Compute euclidean distances between each pair of centroid
    ctrds_distance = pdist(centroids) 

    # Tricky step 4: Compute Davies-Bouldin index of each pair of cluster   
    for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
        second_cluster_idx.append(e)
        if second_cluster_idx[i-1] == nbre_of_clusters - 1:
            first_cluster_idx += 1
        DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])

    # Step 5: Compute the mean of all DB_indexes   
    print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))

在参数中：

X是数据 labels，是通过聚类算法计算的标签（即：kmeans） centroids 是每个集群质心的坐标（即：cluster_centers_）

另外，请注意我使用的是 Python 3

问题1：每对质心之间的欧式距离计算是否正确（步骤3）？

QUESTION2：我对第 4 步的实施是否正确？

QUESTION3：我是否需要标准化集群内和集群间距离？

第 4 步的进一步说明

假设我们有 10 个集群。循环应该计算每对集群的数据库索引。

第一次迭代：

将簇 1 的距离内平均值（distances_means 的索引 0）和簇 2 的距离内平均值（distances_means 的索引 1）相加将此总和除以 2 个集群之间的距离（ctrds_distance 的索引 0）

在第二次迭代中：

将簇 1 的距离内平均值（distances_means 的索引 0）和簇 3 的距离内平均值（distances_means 的索引 2）相加将此总和除以 2 个集群之间的距离（ctrds_distance 的索引 1）

等等……

以 10 个集群为例，完整的迭代过程应如下所示：

intra-cluster distance intra-cluster distance       distance between their
      of cluster:             of cluster:           centroids(storage num):
         0           +             1            /             0
         0           +             2            /             1
         0           +             3            /             2
         0           +             4            /             3
         0           +             5            /             4
         0           +             6            /             5
         0           +             7            /             6
         0           +             8            /             7
         0           +             9            /             8
         1           +             2            /             9
         1           +             3            /             10
         1           +             4            /             11
         1           +             5            /             12
         1           +             6            /             13
         1           +             7            /             14
         1           +             8            /             15
         1           +             9            /             16
         2           +             3            /             17
         2           +             4            /             18
         2           +             5            /             19
         2           +             6            /             20
         2           +             7            /             21
         2           +             8            /             22
         2           +             9            /             23
         3           +             4            /             24
         3           +             5            /             25
         3           +             6            /             26
         3           +             7            /             27
         3           +             8            /             28
         3           +             9            /             29
         4           +             5            /             30
         4           +             6            /             31
         4           +             7            /             32
         4           +             8            /             33
         4           +             9            /             34
         5           +             6            /             35
         5           +             7            /             36
         5           +             8            /             37
         5           +             9            /             38
         6           +             7            /             39
         6           +             8            /             40
         6           +             9            /             41
         7           +             8            /             42
         7           +             9            /             43
         8           +             9            /             44

这里的问题是我不太确定distances_means 的索引是否与ctrds_distance 的索引匹配。

换句话说，我不确定计算的第一个集群间距离对应于集群 1 和集群 2 之间的距离。计算的第二个集群间距离对应于集群 3 和集群 1 之间的距离...依此类推，遵循上述模式。

简而言之：恐怕我将成对的集群内距离除以一个不对应的集群间距离。

欢迎任何帮助！

【问题讨论】：

属于codereview.SE而不是***。 @Anony-Mousse 感谢您的建议。我会马上询问这个实现。 @Anony-Mousse 不，它没有。要成为 Code Review 的主题，代码应该在 OP 的知识范围内工作。由于 OP 显然不知道，因此不适合审查。它是cross-posted there now，可能会在不久的将来关闭。如果您有噪声标签 (-1) 会怎样？ 【参考方案1】：

这个比下面的代码快 20 倍，所有的计算都是在 numpy 中完成的。

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty((n,1))
    dbi_arr = np.empty((n,n))
    mask_arr = np.invert(np.eye(n, dtype='bool'))
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # compute pairwise centroid distances, make diagonal elements non-zero
    centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
    # compute pairwise sigma sums
    sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
    # divide 
    dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
    # get mean of max of off-diagonal elements
    dbi_arr = np.where(mask_arr, dbi_arr, 0)
    dbi = np.mean(np.max(dbi_arr, axis=1))
    return dbi

这是一个使用 numpy 1.14、scipy 1.1.0 和 python 3 的实现。计算速度提升不大，但内存占用应该略小。

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.

    Arguments
    ----------
    X : 2D array (n_samples, embed_dim)
    Vector for each example.

    y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
    True labels for each example.

    Returns
    ----------
    dbi : float
        Calculated Davies-Bouldin index.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty(n)
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # loop over non-duplicate cluster pairs
    dbi = 0
    for i in range(n):
        max_Rij = 0
        for j in range(n):
            if j != i:
                Rij = np.divide(sigma_arr[i] + sigma_arr[j], 
                                euclidean(centroid_arr[i,...], centroid_arr[j,...]))
                if Rij > max_Rij:
                    max_Rij = Rij
        dbi += max_Rij
    return dbi/n

【讨论】：

【参考方案2】：

感谢代码和修订 - 真的帮助我开始。更短、更快的版本并不完全正确。我对其进行了修改，以正确平均每个集群最相似集群的分散分数。

原始算法和解释见https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure：

DBI 是每个聚类的相似性度量的平均值其最相似的集群。

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]

    # calculate cluster dispersion
    S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    Ri = []

    for i in range(n_cluster):
        Rij = []
        # establish similarity between each cluster and all other clusters
        for j in range(n_cluster):
            if j != i:
                r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
                Rij.append(r)
         # select Ri value of most similar cluster
         Ri.append(max(Rij)) 

    # get mean of all Ri values    
    dbi = np.mean(Ri)

    return dbi

【讨论】：

【参考方案3】：

这是上述 Davies-Bouldin 索引 naive 实现的更短、更快更正的版本。

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    db = []

    for i in range(n_cluster):
        for j in range(n_cluster):
            if j != i:
                db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))

    return(np.max(db) / n_cluster)

回答我自己的问题：

初稿（第 4 步）中的计数器是正确的，但无关紧要无需标准化簇内和簇间距离计算欧几里得距离时出错

请注意，您可以找到尝试改进此索引的创新方法，尤其是“New Version of Davies-Bouldin Index”，它将欧几里得距离替换为圆柱距离。

【讨论】：

这里，欧几里得只不过是 sklearn.metric.paiwise 的 euclidean_distances，对吧？回头看我似乎在使用 scipy 的代码。 from scipy.spatial.distance import pdist, euclidean cluster_k = [X[labels == k] for k in range(n_cluster)] TypeError: only integer arrays with one element can be convert to an index 我收到此错误我试过这个实现，但是当你有噪声标签（例如-1）时会发生什么？... 这似乎在获取整个质心距离矩阵的最大值而不是每个簇的最大距离时犯了一个错误。【参考方案4】：

感谢您的实施。我只有一个问题：最后一行是否缺少分区。在最后一步中，max(db) 的值应除以实现的集群数。

def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels)) 
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)] 
centroids = [np.mean(k, axis = 0) for k in cluster_k] 
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []

for i in range(n_cluster):
    for j in range(n_cluster):
        if j != i:
            db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))

也许我监督那个部门，因为我是 Python 新手。但是在我的图形中（我正在迭代一系列集群） DB.max 的值在开始时非常低，然后会增加。在按集群数量缩放后，图表看起来更好（开始时 DB.max 值较高，并且随着集群数量的增加而不断下降）。

最好的问候

【讨论】：

我认为你是对的，随着集群内失真的减少，索引确实应该减少。感谢您发现该疏忽，代码现已更新。（顺便说一句，如果您要进行聚类评估，您可能会看看 Gap Statistic 方法，我发现它总体上非常有效。） cluster_k = [X[labels == k] for k in range(n_cluster)] TypeError: only integer arrays with one element can be convert to an index 我收到此错误我收到此错误

以上是关于我对 Davies-Bouldin Index 的 python 实现是不是正确？的主要内容，如果未能解决你的问题，请参考以下文章