聚类算法全面综述

Posted datazero

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了聚类算法全面综述相关的知识,希望对你有一定的参考价值。


Abstract

Data analysis is used as a common method in modern science research, which is across communication science, computer science and biology science. Clustering, as the basic composition of data analysis, plays a significant role. On one hand, many tools for cluster analysis have been created, along with the information increase and subject intersection. On the other hand, each clustering algorithm has its own strengths and weaknesses, due to the complexity of information. In this review paper, we begin at the definition of clustering,takethe basicelements involved intheclustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyze the clustering algorithms from two perspectives, the traditional ones and the modern ones. All the discussed clustering algorithms will be compared in detail and comprehensively shown in Appendix Table 22.

Keywords Clustering · Clustering algorithm · Clustering analysis · Survey · Unsupervised learning

数据分析是现代科学研究中的一种常用方法,涉及通信科学,计算机科学和生物科学。聚类作为数据分析的基本组成部分,起着重要作用。一方面,已经创建了许多用于聚类分析的工具,以及信息增加和主题交叉的工具。另一方面,由于信息的复杂性,每个聚类算法都有其自身的优缺点。在这篇综述文章中,我们从聚类的定义开始,考虑了聚类过程中涉及的基本要素,例如距离或相似性度量和评估指标,并从传统和现代两个角度分析了聚类算法。所有讨论的聚类算法将在附录表22中进行详细比较和全面显示。

1 Introduction

Clustering, considered as the most important question of unsupervised learning, deals with the data structure partition in unknown area and is the basis for further learning. The complete definition for clustering, however, isn’t come to an agreement, and a classic one is described as follows [1]:

聚类被认为是无监督学习中最重要的问题,它处理未知区域中的数据结构分区,是进一步学习的基础。但是,关于聚类的完整定义尚未达成共识,经典的定义如下:[1]:

  • (1) Instances, in the same cluster, must be similar as much as possible;

  • (2) Instances, in the different clusters, must be different as much as possible;

  • (3) Measurement for similarity and dissimilarity must be clear and have the practical meaning;

  • (1)同一集群中的实例必须尽可能相似;

  • (2)不同集群中的实例必须尽可能不同;

  • (3)相似度和相异度的度量必须清楚并具有实际意义;

The standard process of clustering can be divided into the following several steps [2]:

  • (1) Feature extraction and selection: extract and select the most representative features from the original data set;

  • (2) Clustering algorithm design: design the clustering algorithm according to the characteristics of the problem;

  • (3) Result evaluation: evaluate the clustering result and judge the validity of algorithm;

  • (4) Result explanation: give a practical explanation for the clustering result;

  • (1)特征提取与选择:从原始数据集中提取并选择最具代表性的特征;

  • (2)聚类算法设计:

    根据问题的特点设计聚类算法;

  • (3)结果评估:

    评估聚类结果,判断算法的有效性;

  • (4)结果说明:

    对聚类结果进行实际说明;

In the rest of this paper, the common similarity and distance measurements will be introduced in Sect. 2, the evaluation indicators for the clustering result will be listed in section 3, the traditional clustering algorithms and the modern ones will be analyzed systematically respectively in Sects. 4 and 5, and the final conclusion will be drawn in Sect. 6.

在本文的其余部分,将在第2节中介绍常见的相似性和距离度量;在第3节中列出聚类结果的评估指标;在第4和第5节中分别系统分析传统聚类算法和现代聚类算法;最终结论将在第6节中得出。

2 Distance and Similarity

Distance (dissimilarity) and similarity are the basis for constructing clustering algorithms. As for quantitative data features, distance is preferred to recognize the relationship among data. And similarity is preferred when dealing with qualitative data features [2].

  • The common used distance functions for quantitative data feature are summarized in Table 1.

  • The common used similarity functions for qualitative data feature are summarized in Table 2.

距离(不相似度)和相似度是构造聚类算法的基础。对于定量数据特征,距离最好用于识别数据之间的关系。在处理定性数据特征时,相似性是首选[2]。表1总结了定量数据特征的常用距离函数。表2总结了定性数据特征的常用相似度函数。


聚类算法全面综述

3 Evaluation Indicator

The main purpose of evaluation indicator is to test the validity of algorithm. Evaluation indicators can be divided into two categories, the internal evaluation indicators and the external evaluation indicators, in terms of the test data whether in the process of constructing the clustering algorithm.

评估指标的主要目的是测试算法的有效性。根据测试数据是否在构建聚类算法的过程中,评估指标可以分为内部评估指标和外部评估指标两类。

The internal evaluation takes the internal data to test the validity of algorithm. It, however, can’t absolutely judge which algorithm is better when the scores of two algorithms are not equal based on the internal evaluation indicators [5]. There are three commonly used internal indicators, summarized in Table 3.

内部评估采用内部数据来检验算法的有效性。但是,根据内部评估指标[5],当两种算法的得分不相等时,它不能绝对判断哪种算法更好。表3总结了三种常用的内部指标。
聚类算法全面综述

The external evaluation, which is called as the gold standard for testing method, takes the external data to test the validity of algorithm. However, it turns out that the external evaluation is not completely correct recently [6]. There are six common used external evaluation indicators, summarized in Table 4.

外部评估被称为测试方法的黄金标准,它利用外部数据来检验算法的有效性。然而,事实证明,外部评估最近还不完全正确[6]。表4总结了六种常用的外部评估指标。
聚类算法全面综述

In the following sections, especially in the analysis of time complexity, n stands for the number of total objects/data points, k stands for the number of clusters, s stands for the number of sample objects/data points, and t stands for the number of iterations.

在以下各节中,尤其是在时间复杂度分析中,n代表对象/数据点总数,k代表聚类数,s代表样本对象/数据点数,t代表迭代次数。

4 Traditional Clustering Algorithms

The traditional clustering algorithms can be divided into 9 categories which mainly contain 26 commonly used ones, summarized in Table 5.

传统的聚类算法可以分为9类,主要包括26种常用算法,总结在表5中。

聚类算法全面综述

4.1 Clustering Algorithm Based on Partition

The basic idea of this kind of clustering algorithms is to regard the center of data points as the center of the corresponding cluster. K-means [7] and K-medoids [8] are the two most famous ones of this kind of clustering algorithms. The core idea of K-means is to update the center of cluster which is represented by the center of data points, by iterative computation and the iterative process will be continued until some criteria for convergence is met. K-mediods is an improvement of K-means to deal with discrete data, which takes the data point, most near the center of data points, as the representative of the corresponding cluster. The typical clustering algorithms based on partition also include PAM [9], CLARA [10], CLARANS [11].

这种聚类算法的基本思想是将数据点的中心视为相应聚类的中心。K-means [7]和K-medoids [8]是这种聚类算法中两个最著名的算法。K-means的核心思想是通过迭代计算来更新由数据点中心表示的聚类中心,并且迭代过程将继续进行,直到满足某些收敛标准为止。K-medoids 是对K-means的改进,用于处理离散数据,该方法将最靠近数据点中心的数据点作为相应聚类的代表。基于分区的典型聚类算法还包括PAM [9],CLARA [10],CLARANS [11]。

For more information about this kind of clustering algorithms, you can refer to [12–14].
有关这种聚类算法的更多信息,请参考[12-14]。

Analysis:

  • (1) Time complexity (Table 6):

  • (2) Advantages: relatively low time complexity and high computing efficiency in general;

  • (3) Disadvantages: not suitable for non-convex data, relatively sensitive to the outliers, easily drawn into local optimal, the number of clusters needed to be preset, and the clustering result sensitive to the number of clusters.;

  • (4) AP algorithm [15], which will be discussed in the section Clustering algorithm based on affinity propagation, can also be considered as one of this kind of clustering algorithm.

  • (1)时间复杂度(表6):

  • (2)优点:时间复杂度相对较低,总体上具有较高的计算效率;

  • (3)缺点:不适用于非凸数据,对离群值比较敏感,容易被引入局部最优,需要预设的聚类数量,并且聚类结果对聚类数量敏感;

  • (4)AP算法[15](将在基于亲和力传播的聚类算法一节中讨论)也可以被视为这种聚类算法之一。


    聚类算法全面综述

4.2 Clustering Algorithm Based on Hierarchy

The basic idea of this kind of clustering algorithms is to construct the hierarchical relationship among data in order to cluster [16]. Suppose that each data point stands for an individual cluster in the beginning, and then, the most neighboring two clusters are merged into a new cluster until there is only one cluster left. Or, a reverse process. Typical algorithms of this kind of clustering include BIRCH [17], CURE [18], ROCK [19], Chameleon [20]. BIRCH realizes the clustering result by constructing the feature tree of clustering, CF tree, of which one node stands for a subcluster. CF tree will dynamically grow when a new data point comes. CURE, suitable for large-scale clustering, takes random sampling technique to cluster sample separately and integrates the results finally. ROCK is an improvement of CURE for dealing with data of enumeration type, which takes the effect on the similarity from the data around the cluster into consideration. Chameleon, at first, divides the original data into clusters with smaller size based on the nearest neighbor graph,and then the clusters with small size are merged into a cluster with bigger size, based on agglomerative algorithm, until satisfied.

这种聚类算法的基本思想是构建数据之间的层次关系以进行聚类[16]。假设每个数据点在开始时代表一个单独的群集,然后将最邻近的两个群集合并到一个新的群集中,直到只剩下一个群集为止。或者,相反的过程。这种聚类的典型算法包括BIRCH [17],CURE [18],ROCK [19],Chameleon [20]。BIRCH通过构造聚类的特征树CF树来实现聚类结果,其中一个节点代表一个子簇。当出现新的数据点时,CF树将动态增长。适用于大规模聚类的CURE采用随机抽样技术分别对样本进行聚类并最终整合结果。ROCK是CURE的改进,用于处理枚举类型的数据,它考虑了簇周围数据对相似性的影响。Chameleon首先根据最近的邻居图将原始数据划分为较小的聚类,然后根据层次聚类算法(agglomerative algorithm)将较小的聚类合并为较大的聚类,直到满足条件为止。

For more information about this kind of clustering algorithms, you can refer to [21,22].
有关这种聚类算法的更多信息,请参考[21,22]。

Analysis:

  • (1) Time complexity (Table 7):

  • (2) Advantages: suitable for the data set with arbitrary shape and attribute of arbitrary type, the hierarchical relationship among clusters easily detected, and relatively high scalability in general;

  • (3) Disadvantages: relatively high in time complexity in general, the number of clusters needed to be preset.

  • (1)时间复杂度(表7):

  • (2)优点:适用于具有任意形状和任意类型属性的数据集,聚类之间的层次关系易于检测,并且总体上具有较高的可伸缩性;

  • (3)缺点:时间复杂度一般较高,需要预先设置簇数。
    聚类算法全面综述

    4.3 Clustering Algorithm Based on Fuzzy Theory

    The basic idea of this kind of clustering algorithms is that the discrete value of belonging label, {0, 1}, is changed into the continuous interval [0, 1], in order to describe the belonging relationship among objects more reasonably. Typical algorithms of this kind of clustering include FCM [23–25], FCS [26] and MM [27]. The core idea of FCM is to get membership of each data point to every cluster by optimizing the object function. FCS, different from the traditional fuzzy clustering algorithms, takes the multidimensional hypersphere as the prototype of each cluster, so as to cluster with the distance function based on the hypersphere. MM, based on the Mountain Function, is used to find the center of cluster.

这种聚类算法的基本思想是将归属标签的离散值{0,1}更改为连续间隔[0,1],以便更合理地描述对象之间的归属关系。这种聚类的典型算法包括FCM [23–25],FCS [26]和MM [27]。FCM的核心思想是通过优化目标函数使每个数据点成为每个群集的成员。FCS与传统的模糊聚类算法不同,它以多维超球面作为每个聚类的原型,从而基于超球面与距离函数聚类。基于Mountain Function的MM用于查找群集的中心。

For more information about this kind of clustering algorithms, you can refer to [28–30].
有关这种聚类算法的更多信息,可以参考[28-30]。

Analysis:

  • 1) Time complexity (Table 8):

  • 2) The time complexity of FCS is high for the kernel involved in the algorithm;

  • 3) Advantages: more realistic to give the probability of belonging, relatively high accuracy of clustering;

  • 4) Disadvantages: relatively low scalability in general, easily drawn into local optimal, the clustering result sensitive to the initial parameter values, and the number of clusters needed to be preset.

  • 1)时间复杂度(表8):

  • 2)对于FCS算法所涉及的核函数,时间复杂度很高;

  • 3)优点:更现实地给出归属的可能性,聚类的准确性相对较高;

  • 4)缺点:通常可伸缩性较低,容易陷入局部最优,聚类结果对初始参数值敏感,并且需要预设聚类数。

聚类算法全面综述

4.4 Clustering Algorithm Based on Distribution

The basic idea is that the data, generated from the same distribution, belongs to the same cluster if there exists several distributions in the original data. The typical algorithms are DBCLASD [31] and GMM [32]. The core idea of DBCLASD, a dynamic incremental algorithm, is that if the distance between a cluster and its nearest data point satisfies the distribution of expected distance which is generated from the existing data points of that cluster, the nearest data point should belong to this cluster. The core idea of GMM is that GMM consists of several Gaussian distributions from which the original data is generated and the data, obeying the same independent Gaussian distribution, is considered to belong to the same cluster.

基本思想是,如果原始数据中存在多个分布,则从相同分布生成的数据将属于同一群集。典型的算法是DBCLASD [31]和GMM [32]。动态增量算法DBCLASD的核心思想是,如果群集与其最近的数据点之间的距离满足从该群集的现有数据点生成的预期距离的分布,则最近的数据点应属于此簇。GMM的核心思想是GMM由几个高斯分布组成,从这些高斯分布中生成原始数据,并且该数据遵循相同的独立高斯分布,被视为属于同一聚类。

For more information about this kind of clustering algorithms, you can refer to [33,34].
有关这种聚类算法的更多信息,您可以参考[33,34]。

Analysis:

  • (1) Time complexity (Table 9):

  • (2) Advantages: more realistic to give the probability of belonging, relatively high scalability by changing the distribution, number of clusters and so on, and supported by the well developed statistical science;

  • (3) Disadvantages: the premise not completely correct, involved in many parameters which have a strong influence on the clustering result and relatively high time complexity.

  • (1)时间复杂度(表9):

  • (2)优点:给出归属概率更现实,通过更改分布,聚类数量等具有相对较高的可伸缩性,并且有统计学理论支撑;

  • (3)缺点:前提不完全正确,涉及很多参数,对聚类结果影响很大,时间复杂度较高。


    聚类算法全面综述

4.5 Clustering Algorithm Based on Density

The basic idea of this kind of clustering algorithms is that the data which is in the region with high density of the data space is considered to belong to the same cluster [35]. The typical ones include DBSCAN [36], OPTICS [37] and Mean-shift [38]. DBSCAN is the most well known density-based clustering algorithm, which is generated from the basic idea of this kind of clustering algorithms directly. OPTICS is an improvement of DBSCAN and it overcomes the shortcoming of DBSCAN that being sensitive to two parameters, the radius of the neighborhood and the minimum number of points in a neighborhood. In the process of Mean-shift, the mean of offset of current data point is calculated at first, the next data point is figured out based on the current data point and the offset then, and last, the iteration will be continued until some criteria are met.

这种聚类算法的基本思想是将数据空间高密度区域中的数据视为属于同一聚类[35]。典型的包括DBSCAN [36],OPTICS [37]和Mean-shift [38]。DBSCAN是最著名的基于密度的聚类算法,它是直接根据这种聚类算法的基本思想生成的。OPTICS是DBSCAN的改进,它克服了DBSCAN对两个参数(邻域半径和邻域中最小点的数量)敏感的缺点。在Mean-shift过程中,首先计算当前数据点的偏移量的平均值,然后根据当前数据点和偏移量计算出下一个数据点,最后,迭代将继续进行,直到达到某些条件被满足为止。

For more information about this kind of clustering algorithms, you can refer to [39–42].
有关此类聚类算法的更多信息,请参考[39–42]。

Analysis:

  • (1) Time complexity (Table 10):

  • (2) The time complexity of Mean-shift is high for the kernel involved in the algorithm;

  • (3) Advantages: clustering in high efficiency and suitable for data with arbitrary shape;

  • (4) Disadvantages: resulting in a clustering result with low quality when the density of data space isn’t even, a memory with big size needed when the data volume is big, and the clustering result highly sensitive to the parameters; (5) DENCLUE algorithm [43], which will be discussed in the section Clustering algorithm for large-scale data, can also be considered as one of this kind of clustering algorithms.

  • (1)时间复杂度(表10):

  • (2)对于Mean-shift算法所涉及的核函数,时间复杂度较高;

  • (3)优点:高效聚类,适合任意形状的数据;

  • (4)缺点:当数据空间密度不均匀时,导致聚类结果质量低下;当数据量大时,需要大容量存储器,并且聚类结果对参数高度敏感;

  • (5)DENCLUE算法[43](将在大型数据的聚类算法部分中讨论)也可以被视为此类聚类算法之一。


    聚类算法全面综述

4.6 Clustering Algorithm Based on Graph Theory

According to this kind of clustering algorithms, clustering is realized on the graph where the node is regarded as the data point and the edge is regarded as the relationship among data points. Typical algorithms of this kind of clustering are CLICK [44] and MST-based clustering [45]. The core idea of CLICK is to carry out the minimum weight division of the graph with iteration in order to generate the clusters. Generating the minimum spanning tree from the data graph is the key step to do the cluster analysis for the MST-based clustering algorithm.

根据这种聚类算法,可以在图上实现聚类,其中将节点视为数据点,将边缘视为数据点之间的关系。这种聚类的典型算法是CLICK [44]和基于MST的聚类[45]。CLICK的核心思想是通过迭代执行图的最小权重划分,以生成聚类。从数据图生成最小生成树是对基于MST的聚类算法进行聚类分析的关键步骤。

For more detailed information about this kind of clustering algorithms, you can refer to [1,20,46–49].
有关此类聚类算法的更多详细信息,请参考[1,20,46–49]。

Analysis:

  • (1) Time complexity (Table 11): where v stands for the number of vertices, e stands for the number of edges, and f(v, e) stands for the time complexity of computing a minimum cut;

  • (2) Advantages: clustering in high efficiency,theclusteringresultwithhighaccuracy;

  • (3) Disadvantages: the time complexity increasing dramatically with the increasing of graph complexity;

  • (4) SMalgorithm[50] and NJW algorithm [51], which will be discussed in the section Clustering algorithm based on spectral graph theory, can also be considered as ones of this kind of clustering algorithms.

  • (1)时间复杂度(表11):其中v代表顶点数,e代表边数,f(v,e)代表计算最小切割的时间复杂度;

  • (2)优点:聚类效率高,结果准确。

  • (3)缺点:时间复杂度随着图复杂度的增加而急剧增加;

  • (4)SMalgorithm [50]和NJW算法[51](将在基于频谱图理论的聚类算法部分中进行讨论)也可以被视为此类聚类算法。

聚类算法全面综述

4.7 Clustering Algorithm Based on Grid

The basic idea of this kind of clustering algorithms is that the original data space is changed into a grid structure with definite size for clustering. The typical algorithms of this kind of clustering are STING [52] and CLIQUE [53]. The core idea of STING which can be used for parallel processing is that the data space is divided into many rectangular units by constructing the hierarchical structure and the data within different structure levels is clustered respectively. CLIQUE takes advantage of the grid-based clustering algorithms and the density-based clustering algorithms.

这种聚类算法的基本思想是将原始数据空间更改为具有确定大小的网格结构以进行聚类。这种聚类的典型算法是STING [52]和CLIQUE [53]。可以用于并行处理的STING的核心思想是,通过构造层次结构将数据空间划分为多个矩形单元,并分别对不同结构级别内的数据进行聚类。CLIQUE利用了基于网格的聚类算法和基于密度的聚类算法。

For more detailed information about this kind of clustering algorithms, you can refer to [41,54–57].
有关这种聚类算法的更多详细信息,请参考[41,54–57]。

Analysis:

  • (1) Time complexity (Table 12):

  • (2) Advantages: low time complexity, high scalability and suitable for parallel processing and increment updating;

  • (3) Disadvantages: the clustering result sensitive to the granularity (the mesh size), the high calculation efficiency at the cost of reducing the quality of clusters and reducing the clustering accuracy;

  • (4) Wavecluster algorithm [54], which will be discussed in the section Clustering algorithm for spatial data, can also be considered as ones of this kind of clustering algorithms.

  • (1)时间复杂度(表12):

  • (2)优点:时间复杂度低,可伸缩性高,适合并行处理和增量更新;

  • (3)缺点:聚类结果对粒度(网格尺寸)敏感,计算效率高,但降低了聚类质量,降低了聚类精度;

  • (4)Wavecluster算法[54](将在空间数据的聚类算法部分中进行讨论)也可以被视为此类聚类算法。


    聚类算法全面综述

    4.8 Clustering Algorithm Based on Fractal Theory

Fractal stands for the geometry that can be divided into several parts which share some common characters with the whole [58]. The typical algorithm of this kind of clustering is FC [59] of which the core idea is that the change of any inner data of a cluster does not have any influence on the intrinsic quality of the fractal dimension.

分形(Fractal)代表的几何形状可以分为几个部分,它们与整体具有某些共同的特征[58]。这种聚类的典型算法是FC [59],其核心思想是,聚类的任何内部数据的变化对分形维数的内在质量没有任何影响。

For more detailed information about this kind of clustering algorithms, you can refer to [60–63].
有关此类聚类算法的更多详细信息,请参考[60–63]。

Analysis:

  • (1) The time complexity of FC is O(n);

  • (2) Advantages: clustering in high efficiency, high scalability, dealing with outliers effectively and suitable for data with arbitrary shape and high dimension;

  • (3) Disadvantages: the premise not completely correct, the clustering result sensitive to the parameters.

  • (1)FC的时间复杂度为O(n);

  • (2)优点:高效聚类,可扩展性强,有效处理离群值,适用于任意形状,高维数据;

  • (3)缺点:在不完全正确的前提下,聚类结果对参数敏感。

4.9 Clustering Algorithm Based on Model

The basic idea is to select a particular model for each cluster and find the best fitting for that model. There are mainly two kinds of model-based clustering algorithms, one based on statistical learning method and the other based on neural network learning method.

基本思想是为每个群集选择一个特定的模型,并找到最适合该模型的模型。基于模型的聚类算法主要有两种,一种基于统计学习方法,另一种基于神经网络学习方法。

The typical algorithms, based on statistical learning method, are COBWEB [64] and GMM [32]. The core idea of COBWEB is to build a classification tree, based on some heuristic criteria,in order to realize hierarchical clustering on the assumption that the probability distribution of each attribute is independent. The typical algorithms, based on neural network learning method, are SOM [65] and ART [66–69]. The core idea of SOM is to build a mapping of dimension reduction from the input space of high dimension to output space of low dimension on the assumption that there exists topology in the input data. The core idea of ART, an incremental algorithm, is to generate a new neuron dynamically to match a new pattern to create a new cluster when the current neurons are not enough. GMM has been discussed in the section Clustering algorithm based on distribution.

基于统计学习方法的典型算法是COBWEB [64]和GMM [32]。COBWEB的核心思想是基于某种启发式标准构建分类树,以实现假设每个属性的概率分布是独立的层次聚类。基于神经网络学习方法的典型算法是SOM [65]和ART [66-69]。SOM的核心思想是在输入数据中存在拓扑的假设下,建立从高维输入空间到低维输出空间的降维映射。ART的核心思想是一种增量算法,它可以在当前神经元不足时动态生成新的神经元,以匹配新的模式以创建新的簇。在基于分布的聚类算法一节中已经讨论了GMM。

For more detailed information about this kind of clustering algorithms, you can refer to [70–75].
有关这种聚类算法的更多详细信息,请参考[70–75]。

Analysis:

  • (1) Time complexity (Table 13):

  • (2) The time complexity of COBWEB is generally low, which depends on the distribution involved in the algorithm;

  • (3) The time complexity of SOM is generally high, which depends on the layer construction involved in the algorithm;

  • (4) The time complexity of ART is generally middle, which depends on the type of ART and the layer construction involved in the algorithm;

  • (5) Advantages: diverse and well developed models providing means to describe data adequately and each model having its own special characters that may bring about some significant advantages in some specific areas;

  • (6) Disadvantages: relatively high time complexity in general, the premise not completely correct, and the clustering result sensitive to the parameters of selected models.

  • (1)时间复杂度(表13):

  • (2)COBWEB的时间复杂度通常较低,这取决于算法所涉及的分布;

  • (3)SOM的时间复杂度一般较高,这取决于算法所涉及的层构造;

  • (4)ART的时间复杂度一般处于中等水平,这取决于ART的类型和算法所涉及的层构造;

  • (5)优势:多样化且发展完善的模型提供了充分描述数据的手段,每种模型都有其自身的特征,可能在某些特定领域带来一些明显的优势;

  • (6)缺点:时间复杂度一般较高,前提若不完全正确,则聚类结果对所选模型的参数敏感。

聚类算法全面综述

5 Modern Clustering Algorithms

The modern clustering algorithms can be divided into 10 categories which mainly contain 45 commonly used ones, summarized in Table 14.

现代聚类算法可分为10类,主要包含45种常用算法,总结在表14中。
聚类算法全面综述

5.1 Clustering Algorithm Based on Kernel

The basic idea of this kind of clustering algorithms is that data in the input space is transformed into the feature space of high dimension by the nonlinear mapping for the cluster analysis. The typical algorithms of this kind of clustering include kernel Kmeans [76], kernel SOM [77], kernel FCM [78], SVC [79], MMC [80] and MKC [81]. The basic idea of kernel K-means, kernel SOM and kernel FCM is to take advantage of the kernel method and the original clustering algorithm, transforming the original data into a high dimensional feature space by nonlinear kernel function in order to carry out the original clustering algorithm. The core idea of SVC is to find the sphere with the minimum radius that can cover all the data point in the high dimensional feature space, then map the sphere back into the original data space to form the isoline, namely the border of clusters, covering the data, and the data in the closed isoline should belong to the same cluster. MMC tries to find the hyperplane with the maximum margin to cluster and it can be promoted for the multi-label clustering problem. MKC, an improvement of MMC, tries to find the best hyperplane based on several kernels to cluster. MMC and MKC share the limitation of computation to a degree.

这种聚类算法的基本思想是,通过非线性映射进行聚类分析,将输入空间中的数据转换为高维特征空间。这种聚类的典型算法包括 kernel K-means [76], kernel SOM [77], kernel FCM [78],SVC [79],MMC [80]和MKC [81]。kernel K-means,kernel SOM 和 kernel FCM的基本思想是利用核方法和原始聚类算法,通过非线性核函数将原始数据转换为高维特征空间,从而进行原始聚类。SVC的核心思想是找到半径最小的球体,该球体可以覆盖高维特征空间中的所有数据点,然后将球体映射回原始数据空间以形成等值线,即簇的边界,覆盖数据,并且闭合等值线中的数据应属于同一簇。MMC试图找到具有最大余量的超平面以进行聚类,并且可以将其升级为多标签聚类问题。MKC是MMC的改进,它试图基于要聚类的几个内核找到最佳的超平面。MMC和MKC在一定程度上共享计算的局限性。

For more detailed information about this kind of clustering algorithms, you can refer to [82–84].

有关这种聚类算法的更多详细信息,请参考[82–84]。

Analysis:

  • (1) Time complexity (Table 15):

  • (2) The time complexity of this kind of clustering algorithms is generally high for the kernel involved in the algorithm;

  • (3) Advantages: more easy to cluster in the high dimensional feature space, suitable for data with arbitrary shape, able to analyze the noise and separate the overlapping clusters, and not needed to have the preliminary knowledge about the topology of data;

  • (4) Disadvantages: the clustering result sensitive to the type of kernel and its parameters, time complexity being high, and not suitable for large-scale data.

  • (1)时间复杂度(表15):

  • (2)对于该算法所涉及的内核,这种聚类算法的时间复杂度通常较高;

  • (3)优点:

    在高维特征空间中更容易聚类,适合于任意形状的数据,能够分析噪声并分离出重叠的聚类,并且不需要对数据的拓扑有初步的了解;

  • (4)缺点:

    聚类结果对内核类型及其参数敏感,时间复杂度高,不适合大规模数据。


    聚类算法全面综述

5.2 Clustering Algorithm Based on Ensemble

Clustering algorithm based on ensemble is also called ensemble clustering, of which the core idea is to generate a set of initial clustering results by a particular method and the final clustering result is got by integrating the initial clustering results. There are mainly 4 kinds of methods to get the set of initial clustering results as follows:

基于集成的聚类算法也称为集成聚类,其核心思想是通过一种特定的方法生成一组初始聚类结果,并通过整合初始聚类结果获得最终聚类结果。获得初始聚类结果集的方法主要有以下四种:

  • (1) For the same data set, employ the same algorithm with the different parameters or the different initial conditions [85];

  • (2) For the same data set, employ the different algorithms [86];

  • (3) For the subsets, carry out the clustering respectively [86];

  • (4) For the same data set, carry out the clustering in different feature spaces based on different kernels [87].

  • (1)对于相同的数据集,采用具有不同参数或不同初始条件的相同算法[85];

  • (2)对于相同的数据集,采用不同的算法[86];

  • (3)对于子集,分别进行聚类[86];

  • (4)对于相同的数据集,基于不同的内核在不同的特征空间中进行聚类[87]。

The initial clustering results are integrated by means of the consensus function. The consensus functions can be divided into the following 9 categories, summarized in Table 16:

初始聚类结果通过共识函数(consensus function)进行集成。共识函数可以分为以下9类,总结于表16:

For more detailed information about this kind of clustering algorithms, you can refer to [95].
有关此类聚类算法的更多详细信息,请参考[95]。

Analysis:

  • (1) The time complexity of this kind of algorithm is based on the specific method and algorithms involved in the algorithm;

  • (2) Advantages: robust, scalable, able to be parallel and taking advantage of the strengths of the involved algorithms;

  • (3) Disadvantages: inadequate understanding about the difference among the initial clustering results, existing deficiencies of the design of the consensus function.

  • (1)这种算法的时间复杂度是基于特定的方法和算法所涉及的算法;

  • (2)优点:健壮,可扩展,能够并行并且利用所涉及算法的优势;

  • (3)劣势:对初始聚类结果之间的差异了解不足,共识函数设计存在不足。


    聚类算法全面综述

    5.3 Clustering Algorithm Based on Swarm Intelligence

    The basic idea of this kind of clustering algorithms is to simulate the changing process of the biological population. Typical algorithms include the 4 main categories: ACO_based [96,97], PSO_based [97,98], SFLA_based [99] and ABC_based [100]. The core idea of LF [101], the typical algprithm of the ACO_based, is that data is distributed randomly on the grid of two dimensions first, then the data is selected or not for further operation based on the decision of an ant and this process is iterated until a satisfactory clustering result is got. The PSO_based algorithms regard the data point as a particle. The initial clusters of particles is got by the other clustering algorithm first, then the clusters of particles is updated continuously based on the center of clusters and the location and speed of each particle, until a satisfactory clustering result is got. The core idea of the SFLA_based algorithms is to simulate the information interaction of frogs and taking advantage of the local search and the global information interaction. The core idea of the ABC_based algorithms is to simulate the foraging behavior of three types of bee, of which the duty is to determine the food source, in a bee population and making use of the exchange of local information and global information for clustering.

这种聚类算法的基本思想是模拟生物种群的变化过程。典型的算法包括4个主要类别:基于ACO的[96,97],基于PSO的[97,98],基于SFLA的[99]和基于ABC的[100]。LF[101]的核心思想是基于ACO的典型算法,它是首先将数据随机分布在二维网格上,然后根据蚂蚁的决定选择该数据或不进行进一步操作,此过程迭代直到获得满意的聚类结果。基于PSO_的算法将数据点视为一个粒子。首先通过其他聚类算法得到粒子的初始聚类,然后根据聚类的中心,每个粒子的位置和速度不断更新粒子的聚类,直到获得满意的聚类结果。基于SFLA的算法的核心思想是模拟青蛙的信息交互,并利用局部搜索和全局信息交互的优势。基于ABC的算法的核心思想是模拟三种蜜蜂的觅食行为,其中的职责是确定蜜蜂种群中的食物来源,并利用本地信息和全局信息的交换进行聚类。

For more detailed information about this kind of clustering algorithms, you can refer to [102–104].
有关此类聚类算法的更多详细信息,请参阅[102–104]。

Analysis:

  • (1) Time complexity (Table 17):

  • (2) The time complexity of this kind of algorithm is high, mainly for the large number of iterations;

  • (3) Advantages: algorithm with the character of overcoming being easily drawn into local optimal and getting the global optimal, easy to understand the algorithm;

  • (4) Disadvantages: low scalability, low operating efficiency and not suitable for high dimensional or large-scale data.

  • (1)时间复杂度(表17):

  • (2)这种算法的时间复杂度高,主要是针对大量的迭代;

  • (3)优点:具有克服困难的特点,容易被引入局部最优而得到全局最优的算法,算法容易理解;

  • (4)缺点:可扩展性低,操作效率低并且不适合于高维或大规模数据。

聚类算法全面综述

5.4 Clustering Algorithm Based on Quantum Theory

The clustering algorithm based on quantum theory is called quantum clustering, of which the basic idea is to study the distribution law of sample data in the scale space by studying the distribution law of particles in the energy field. The typical algorithms of this kind include QC [105,106] and DQC [107]. The core idea of QC (quantum clustering), suitable for high dimensional data, is to get the potential energy of each object by Schrodinger Equation using the iterative gradient descent algorithm, regard the object with low potential energy as the center of the cluster, and put the objects into different clusters by the defined distance function. DQC, an improvement of QC, adopts the time-based Schrodinger Equation in order to study the change of the original data set and the structure of the quantum potential energy function dynamically.

基于量子理论的聚类算法称为量子聚类,其基本思想是通过研究能量场中粒子的分布规律来研究尺度空间中样本数据的分布规律。这种典型的算法包括QC [105,106]和DQC [107]。适用于高维数据的QC(量子聚类)的核心思想是,使用迭代梯度下降算法通过Schrodinger方程获取每个物体的势能,将具有低势能的物体作为聚类的中心,并且通过定义的距离函数将对象放入不同的群集中。DQC是对QC的改进,它采用基于时间的Schrodinger方程,以便动态研究原始数据集的变化和量子势能函数的结构。

For more detailed information about this kind of clustering algorithms, you can refer to [108–110].
有关这种聚类算法的更多详细信息,可以参考[108-110]。

Analysis:

  • (1) Time complexity (Table 18):

  • (2) The time complexity of QC is high, for the process of solving the Schrodinger Equation and the large number of iterations;

  • (3) The time complexity of DQC which is more practical compared with DQ, is middle for the process of solving the Schrodinger Equation;

  • (4) Advantages: the number of parameters involved in this kind of algorithm being small, the determination of the center of a cluster based on the potential information of sample data; (5) Disadvantages: the clustering result sensitive to the parameters of the algorithm, the algorithm model not able to describe the change law of data completely.

  • (1)时间复杂度(表18):

  • (2)QC的时间复杂度很高,因为求解Schrodinger方程的过程和大量的迭代;

  • (3)DQC的时间复杂度比DQ实用得多,在求解Schrodinger方程的过程中处于中间位置;

  • (4)优点:这种算法所涉及的参数数量少,根据样本数据的潜在信息确定聚类中心;

  • (5)缺点:聚类结果对算法参数敏感,算法模型不能完全描述数据的变化规律。

聚类算法全面综述

5.5 Clustering Algorithm Based on Spectral Graph Theory

The basic idea of this kind of clustering algorithms is to regard the object as the vertex and the similarity among objects as the weighted edge in order to transform the clustering problem into a graph partition problem. And the key is to find a method of graph partition making the weight of connection between different groups small as much as possible and the total weight of connection among the edges within the same group high as much as possible [111]. The typical algorithms of this kind of clustering can be mainly divided into two categories, recursive spectral and multiway spectral and the typical algorithms of this two categories are SM [50] and NJW [51] respectively. The core idea of SM which is usually used for image segmentation is to minimize Normalized Cut by heuristic method, based on the eigenvector. And NJW carries out the clustering analysis in the feature space constructed by the eigenvectors corresponding to the k largest eigenvalues of the Laplacian matrix.

这种聚类算法的基本思想是将对象作为顶点,将对象之间的相似度作为加权边,以将聚类问题转化为图划分问题。关键是要找到一种图分割的方法,使不同组之间的连接权重尽可能小,并且同一组内的边之间的连接总权重尽可能高[111]。这种聚类的典型算法主要可以分为两类,递归谱和多路谱,这两类的典型算法分别是SM [50]和NJW [51]。通常用于图像分割的SM的核心思想是基于特征向量,通过启发式方法最小化Normalized Cut。NJW在由特征向量构成的特征空间中进行聚类分析,特征向量对应于拉普拉斯矩阵的k个最大特征值。

For more detailed information about this kind of clustering algorithms, you can refer to [51,84,112–114].
有关这种聚类算法的更多详细信息,请参考[51,84,112–114]。

Analysis:

  • (1) Time complexity (Table 19):

  • (2) The time complexity of SM is high, for the process of figuring out the eigenvectors and the heuristic method involved in the algorithm;

  • (3) The time complexity of NJW is high, for the process of figuring out the eigenvectors;

  • (4) Advantages: suitable for the data set with arbitrary shape and high dimension, converged to the global optimal, only the similarity matrix needed as the input, and not sensitive to the outliers;

  • (5) Disadvantages: the clustering result sensitive to the scaling parameter, time complexity relatively high, unclear about the construction of similarity matrix, the selection of eigenvector not optimized and the number of clusters needed to be preset.

  • (1)时间复杂度(表19):

  • (2)SM的时间复杂度很高,因为要找出特征向量和算法所涉及的启发式方法。

  • (3)NJW的时间复杂度很高,需要找出特征向量。

  • (4)优点:适用于任意形状,高维,收敛于全局最优的数据集,仅需输入相似矩阵即可,对异常值不敏感;

  • (5)缺点:聚类结果对缩放参数敏感,时间复杂度较高,相似矩阵的构造不清楚,特征向量的选择未优化,需要预先设置聚类数量。

聚类算法全面综述

5.6 Clustering Algorithm Based on Affinity Propagation

AP (affinity propagation clustering) is a significant algorithm, which was proposed in Science in 2007. The core idea of AP is to regard all the data points as the potential cluster centers and the negative value of the Euclidean distance between two data points as the affinity. So, the sum of the affinity of one data point for other data points is bigger, the probability of this data point to be the cluster center is higher. AP algorithm takes the greedy strategy which maximizes the value of the global function of the clustering network during every iteration [15].

AP(亲和传播聚类)是一种重要的算法,于2007年在《科学》杂志上提出。AP的核心思想是将所有数据点视为潜在的聚类中心,而将两个数据点之间的欧式距离的负值视为亲和力。因此,一个数据点与其他数据点的亲和度之和较大,该数据点成为聚类中心的可能性较高。AP算法采用贪婪策略,该策略在每次迭代过程中最大化集群网络的全局函数[15]。

For more detailed information about this kind of clustering algorithms, you can refer to [115–117].
有关这种聚类算法的更多详细信息,请参考[115–117]。

Analysis:

  • (1) The time complexity of AP is O(nˆ2*logn);

  • (2) Advantages: simply and clear algorithm idea, insensitive to the outliers and the number of clusters not needed to be preset;

  • (3) Disadvantages: high time complexity, not suitable for very large data set, and the clustering result sensitive to the parameters involved in AP algorithm.

  • (1)AP的时间复杂度为O(nˆ2 * logn);

  • (2)优点:算法思路简洁明了,对离群值和不需要预先设置的簇数不敏感;

  • (3)缺点:时间复杂度高,不适合非常大的数据集,并且聚类结果对AP算法中涉及的参数敏感。

5.7 Clustering Algorithm Based on Density and Distance

DD (Density and distance-based clustering) is another significant clustering algorithm proposed in Science in 2014 [118], of which the core idea is novel. And the main characteristic of DD is for the description of the cluster center, which is shown as follows:

DD(基于密度和距离的聚类)是Science在2014年提出的另一种重要聚类算法[118],其核心思想是新颖的。DD的主要特征是对集群中心的描述,如下所示:

  • (1) with high local density: the number of data points near the cluster center within a certain scope must be big enough;

  • (2) away from other data points with high local density: cluster center must be away from other data points that could be the center of a cluster.

  • (1)局部密度高:在一定范围内,聚类中心附近的数据点数量必须足够大;

  • (2)远离具有高局部密度的其他数据点:群集中心必须远离可能是群集中心的其他数据点。

The core idea of DD is to figure out, based on the distance function, the local density of each data point and the shortest distance among each data point and other data points with higher local density in order to construct the decision graph first, select the cluster centers based on the decision graph then, and put the remaining data points into the nearest cluster with higher local density at last.

DD的核心思想是基于距离函数,求出每个数据点的局部密度,以及每个数据点与其他局部密度较高的数据点之间的最短距离,以便首先构建决策图,选择然后,基于决策图对聚类中心进行聚类,最后将剩余数据点放入局部密度较高的最近聚类中。

Analysis:

  • (1) The time complexity of DD is O(nˆ2);

  • (2) Advantages: simply and clear algorithm idea, suitable for the data set with arbitrary shape and insensitive to the outliers;

  • (1)DD的时间复杂度为O(nˆ2);

  • (2)优点:算法思路简洁明了,适用于任意形状,对异常值不敏感的数据集;

5.8 Clustering Algorithm for Spatial Data

Spatial data refers to the data with the two dimensions, time and space, at the same time, sharing the characteristics of large in scale, high in speed and complex in information. The typical algorithms of this kind of clustering include DBSCAN [36], STING [52], Wavecluster [54] and CLARANS [11]. The core idea of Wavecluster which can be used for parallel processing is to carry out the clustering in the new feature space by applying the Wavelet Transform to the original data. And the core idea of CLARANS is to sample based on CLARA [10] and carry out clustering by PAM [9]. DBSCAN has been discussed in the section Clustering algorithm based on density and STING has been discussed in the section Clustering algorithm based on grid.

空间数据是指同时具有时间和空间两个维度的数据,具有规模大,速度快,信息复杂的特点。这种聚类的典型算法包括DBSCAN [36],STING [52],Wavecluster [54]和CLARANS [11]。可用于并行处理的Wavecluster的核心思想是通过对原始数据应用Wavelet变换在新的特征空间中进行聚类。CLARANS的核心思想是基于CLARA [10]进行采样,并通过PAM [9]进行聚类。在基于密度的聚类算法一节中讨论了DBSCAN,在基于网格的聚类算法一节中讨论了STING。

For more detailed information about this kind of clustering algorithms, you can refer to [119–122], ST-DBSCAN [123].
有关这种聚类算法的更多详细信息,请参考[119–122],ST-DBSCAN [123]。

Time complexity (Table 20):
聚类算法全面综述

5.9 Clustering Algorithm for Data Stream

Data stream shares the characteristics of arriving based on sequence, large in scale and limited frequency of reading. The typical algorithms of this kind of clustering include STREAM [124], CluStream [125], HPStream [126], DenStream [127] and the latter three are incremental algorithms. STREAM, based on the idea of divide and conquer, deals with the data successively according to the sequence of data arriving in order to construct the hierarchical clustering structure. CluStream, which mainly deals with the shortcoming of STREAM that only describing the original data statically, regards data as a dynamic changing process. So CluStream can not only give the timely response for a request, but it also gives the clustering result in terms of different time granularities by figuring out the Micro-clusters online and offline. HPStream, an improvement of CluStream, takes the attenuation of data’s influence over time into consideration and is more suitable for clustering data with high dimension. DenStream, which takes the core idea of the clustering algorithm based on density, is suitable for the nonconvex dataset and can deal with outliers efficiently,compared with the algorithms mentioned above in this section.

数据流具有按顺序传递,规模大,读取频率受限的特点。这种聚类的典型算法包括STREAM [124],CluStream [125],HPStream [126],DenStream [127],后三个是增量算法。STREAM基于分而治之的思想,根据到达的数据顺序对数据进行连续处理,以构建分层聚类结构。CluStream主要解决STREAM仅静态描述原始数据的缺点,将数据视为动态变化的过程。因此,CluStream不仅可以及时响应请求,还可以通过计算联机和脱机的微型群集来提供不同时间粒度的群集结果。HPStream是CluStream的改进,它考虑了随着时间的推移数据影响的减弱,并且更适合于对高维度的数据进行群集。DenStream以基于密度的聚类算法为核心思想,适用于非凸数据集,并且与本节中上述算法相比,可以有效地处理离群值。

For more detailed information about this kind of clustering algorithms, you can refer to [128–131], D-Stream [41,132].
有关此类聚类算法的更多详细信息,请参阅[128–131],D-Stream [41,132]。

Time complexity (Table 21): The time complexity of CluStream, HPStream and DenStream is involved in the online and offline processes.

时间复杂度(表21):在线和离线过程都涉及CluStream,HPStream和DenStream的时间复杂度。
聚类算法全面综述

5.10 Clustering Algorithm for Large-Scale Data

Big data shares the characteristics of 4 V’s, large in volume, rich in variety, high in velocity and doubt in veracity [133]. The main basic ideas of clustering for big data can be summarized in the following 4 categories:

大数据具有4 V的特点,它的容量大,种类繁多,速度快且不确定性[133]。大数据集群的主要基本思想可以归纳为以下4类:

  • (1) sample clustering [10,18];

  • (2) data merged clustering [17,134];

  • (3) dimension-reducing clustering [135,136];

  • (4) parallel clustering [114,137–139];

  • (1)样本聚类[10,18];

  • (2)数据合并聚类[17,134];

  • (3)降维聚类[135,136];

  • (4)并行聚类[114,137–139];

Typical algorithms of this kind of clustering are K-means [7], BIRCH [17], CLARA [10], CURE [18], DBSCAN [36], DENCLUE [43], Wavecluster [54] and FC [59].

这种聚类的典型算法是K-均值[7],BIRCH [17],CLARA [10],CURE [18],DBSCAN [36],DENCLUE [43],Wavecluster [54]和FC [59]。

For more detailed information about this kind of clustering algorithms, you can refer to [2,13,140,141].
有关此类聚类算法的更多详细信息,请参考[2,13,140,141]。

The time complexity of DENCLUE is O(nlogn) and the complexities of K-means, BIRCH, CLARA, CURE, DBSCAN, Wavecluster and FC have been described before in other sections.

DENCLUE的时间复杂度为O(nlogn),之前在其他部分中已经描述了K均值,BIRCH,CLARA,CURE,DBSCAN,Wavecluster和FC的复杂度。

6 Conclusions

This paper starts at the basic definitions of clustering and the typical procedure,lists the commonly used distance(dissimilarity)functions,similarity functions,and evaluation indicators that lay the foundation of clustering, and analyzes the clustering algorithms from two perspectives, the traditional ones that contain 9 categories including 26 algorithms and the modern ones that contain 10 categories including 45 algorithms. The detailed and comprehensive comparisons of all the discussed clustering algorithms are summarized in Appendix Table 22.

本文从聚类的基本定义和典型过程入手,列出了构成聚类基础的常用距离(不相似)函数,相似性函数和评估指标,并从传统的两个角度分析了聚类算法。包含9个类别,包括26种算法,而现代类别包含10个类别,包括45种算法。附录表22中汇总了所有讨论的聚类算法的详细全面比较。

The main purpose of the paper is to introduce the basic and core idea of each commonly used clustering algorithm, specify the source of each one, and analyze the advantages and disadvantages of each one. It is hard to present a complete list of all the clustering algorithms due to the diversity of information, the intersection of research fields and the development of modern computer technology. So 19 categories of the commonly used clustering algorithms, with high practical value and well studied, are selected and one or several typical algorithm(s) of each category is(are) discussed in detail so as to give readers a systematical and clear view of the important data analysis method, clustering.

本文的主要目的是介绍每种常用聚类算法的基本思想和核心思想,指定每种聚类算法的来源,并分析每种聚类算法的优缺点。由于信息的多样性,研究领域的交叉和现代计算机技术的发展,很难给出所有聚类算法的完整列表。因此,选择了19种具有较高实用价值和深入研究的常用聚类算法,并详细讨论了每种类别的一种或几种典型算法,以使读者对它们有一个系统而清晰的认识。重要的数据分析方法,聚类。

Appendix

聚类算法全面综述

聚类算法全面综述
聚类算法全面综述



以上是关于聚类算法全面综述的主要内容,如果未能解决你的问题,请参考以下文章

聚类算法综述

论文两篇重磅机器学习论文:聚类算法综述和分类算法综述

网格聚类算法综述

常用聚类算法综述

学界 | 从文本挖掘综述分类聚类和信息提取等算法

综述适用于聚类算法的2-D处理器阵列体系结构研究概述