层次聚类及scipy中的层次聚类python代码解释
Posted Icy Hunter
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了层次聚类及scipy中的层次聚类python代码解释相关的知识,希望对你有一定的参考价值。
层次聚类假设类别之间存在层次结构,将样本聚到层次化的类中。层次聚类分为自下而上、自上而下聚类的两种方法。由于每个样本只能属于一个类别,因此层次聚类属于硬聚类。
基本原理(本文只讲述自下而上的聚合方法):
- 聚类前数据各自属于一个类
- 计算各个数据之间的距离,将相距最近的两类合并(当类中有许多数据时,存在距离最小的数据组即可实现两类合并),建立一个新的类
- 反复进行2直到最后合并成只剩一个类
可以看出聚合参差聚类算法的复杂度时O(n^3 * m)其中n为样本的个数,m为样本的维数。
scipy中的两个函数即可完成此任务。
scipy.cluster.hierarchy.linkage(data, method=‘average’, metric=“euclidean”)用于层次聚类即完成上述的3步
metric的参数
metric : str or function, optional
The distance metric to use. The distance function can
be 'braycurtis', 'canberra', 'chebyshev', 'cityblock',
'correlation', 'cosine', 'dice', 'euclidean', 'hamming',
'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching',
'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean',
'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.
methord的参数
* method='single' assigns
.. math::
d(u,v) = \\\\min(dist(u[i],v[j]))
for all points :math:`i` in cluster :math:`u` and
:math:`j` in cluster :math:`v`. This is also known as the
Nearest Point Algorithm.
* method='complete' assigns
.. math::
d(u, v) = \\\\max(dist(u[i],v[j]))
for all points :math:`i` in cluster u and :math:`j` in
cluster :math:`v`. This is also known by the Farthest Point
Algorithm or Voor Hees Algorithm.
* method='average' assigns
.. math::
d(u,v) = \\\\sum_ij \\\\fracd(u[i], v[j])
(|u|*|v|)
for all points :math:`i` and :math:`j` where :math:`|u|`
and :math:`|v|` are the cardinalities of clusters :math:`u`
and :math:`v`, respectively. This is also called the UPGMA
algorithm.
* method='weighted' assigns
.. math::
d(u,v) = (dist(s,v) + dist(t,v))/2
where cluster u was formed with cluster s and t and v
is a remaining cluster in the forest (also called WPGMA).
* method='centroid' assigns
.. math::
dist(s,t) = ||c_s-c_t||_2
where :math:`c_s` and :math:`c_t` are the centroids of
clusters :math:`s` and :math:`t`, respectively. When two
clusters :math:`s` and :math:`t` are combined into a new
cluster :math:`u`, the new centroid is computed over all the
original objects in clusters :math:`s` and :math:`t`. The
distance then becomes the Euclidean distance between the
centroid of :math:`u` and the centroid of a remaining cluster
:math:`v` in the forest. This is also known as the UPGMC
algorithm.
* method='median' assigns :math:`d(s,t)` like the ``centroid``
method. When two clusters :math:`s` and :math:`t` are combined
into a new cluster :math:`u`, the average of centroids s and t
give the new centroid :math:`u`. This is also known as the
WPGMC algorithm.
* method='ward' uses the Ward variance minimization algorithm.
The new entry :math:`d(u,v)` is computed as follows,
.. math::
d(u,v) = \\\\sqrt\\\\frac|v|+|s|
Td(v,s)^2
+ \\\\frac|v|+|t|
Td(v,t)^2
- \\\\frac|v|
Td(s,t)^2
where :math:`u` is the newly joined cluster consisting of
clusters :math:`s` and :math:`t`, :math:`v` is an unused
cluster in the forest, :math:`T=|v|+|s|+|t|`, and
:math:`|*|` is the cardinality of its argument. This is also
known as the incremental algorithm.
scipy.cluster.hierarchy.dendrogram(Z, labels=label, above_threshold_color=‘C0’)主要用于画层次聚类图
完整代码如下:
from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
from scipy.cluster import hierarchy # 层次聚类
import matplotlib as mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定中文字体
mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False # 正常显示负号
iris = load_iris()
data = iris.data
label = iris.target
fig = plt.subplots(1, 1, figsize=(50, 8)) # figsize为画布大小
Z = hierarchy.linkage(data, method='average', metric="euclidean") # 计算合并类的方法,这里是取平均距离,距离用的是欧氏距离
hierarchy.dendrogram(Z, labels=label, above_threshold_color='C0') # 画层次聚类图
plt.plot(linewidth=1.0)
plt.xticks(fontsize=14, rotation=0) # x轴标签字体大小与方向调整
plt.rcParams['savefig.dpi'] = 200 # 图片像素
plt.rcParams['figure.dpi'] = 200 # 分辨率
plt.tight_layout() # 自动调整子图参数,使之填充整个图像区域
plt.savefig("H_iris.png", dpi=100, bbox_inches='tight') # 保存图片
plt.show()
用的数据为鸢尾花数据集,可见0和1、2之间的区别是非常明确的,1、2之间的区分也是能够明显看的出来的,
以上是关于层次聚类及scipy中的层次聚类python代码解释的主要内容,如果未能解决你的问题,请参考以下文章
层次聚类python,scipy(dendrogram, linkage,fcluster函数)总算有博文说清楚层次聚类Z矩阵的意义了