KMeans and optimization
Posted satyrs
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了KMeans and optimization相关的知识,希望对你有一定的参考价值。
- random sheme or say naive
input: k, set of n points
place k centroids at random locations 随机选
- repeat the follow operations until convergence 重复到收敛
--for each point i:
- 找到k个中最近centroid j (距离公式)
- 将point i 放入cluster j中
--for each cluster j:
- 对此cluster j中的每个point计算所有的attribute的均值
(attribute不能是categorical or ordinal必须是numeric)
- stop when none of the cluster assignments change 所有点不再改变cluster membership
- O(iterations*k*n*dimensions) ,per interation:O(kn) memory O(k+n)
- 无法precache,每次迭代都会改变centroids
- optimization
1 k-means++(using adaptive sampling scheme) :slow but samll error ; 随机选择:extremely fast,large error
https://blog.csdn.net/the_lastest/article/details/78288955
主要思想:优化中心点的初始化。
2AFK-MC2: using Markov chain improving k-means++
- AFK-MC2 改变seeding的方式
paper :https://las.inf.ethz.ch/files/bachem16fast.pdf
Initial data points are states in the Mchain
a further data point is sampled to act as the candidate for the next state
randomized decision determines whether the Mchain transitions to the candidate or whether it remains old state
repeat and the last state returned as the initial cluster center
- code
- 欧氏距离:np.lianlg.norm(a-b)
- np.loadtxt(naem)
- 变量:
参数:epsilon =0 //threshold,minimun error used in stop condition
history_centroids = []
configuration记录:num_instances,num_features = dataset.shape
初始:prototype = dataset[np.random.randint(0,num_instances-1,size =k)]
np.ndarray num_instances个[],每个[]中num_features个元素,存放centroid:prototypes_old = np.zeros(prototype.shape)
存放cluster:belongs_to=np.zeros((num_instances,1))
4. 迭代:
while norm>epsilon:
iteration+=1
norm = dist_method(prototype,prototype_old) //用来看是否停止,迭代前后的变化
for index_in,instance in enumrate(dataset):
dist_vec = np.zeros((k,1))
for index_prototype,prototype in enumrate(prototypes):
dist_vec[index_prototype] =dist_method[prototype,instance]
belongs_to[index_in,0]=np.argmin(dist_vec)
tmp_prototype = np.zeros((k,num_features))
for .....(cluster)
- scaling n,k
sample and approximation approaches: 效果不好,当k增大分类更糟。
initial centroid selection:(seedling smarter): like \'blaklist\' 、\'Elkan\'s\' 、\'Hamerly\'s\' algorithm
- blacklist algorithm
在data上建立一个tree,在所有centroid上迭代,排除一些。
setup cost O(nlgn) to build tree, computation worst:O(knlgn) , memory O(k+nlgn)
- \'Elkan\'s\'
计算centroids之间距离,平衡points和centroid的距离来减少距离计算
no setup costs,worst O(k^2+kn) memory O(k^2+kn)
- Dual-Tree k-means with bounded single-iteration runtime
paper: http://www.ratml.org/pub/pdf/2016dual.pdf
- build two trees: query-treeT reference-tree Q T:一个instance task of查最近邻,保存点 Q:最近邻来自的set
- 同时traverse 当访问(T.node,Q.node)一对时,看是否可剪,可则prune整个子树(可用于最近邻search, kernel density estimation, kernel conditional density estimation.....等等)
- space tree:不是 space partitioning tree 允许nodes overlap。undirected acyclic rooted simple graph
- 每个节点有许多points(0) 与一个父节点连接,许多子节点(0)
- 根节点
- 每个点至少被包含在一个树节点中
- 每个节点有一个多维的凸子集(convex subset)包含着该节点中的所有点以及孩子节点所表示的convex subsets 即每个节点有bounding shape包含所有descendant points
- traverse
访问pair(T Q节点的组合) no more than once并对combination计算给出score
if score>bound or infinite, the combination is pruned。否则计算Tnode的每个点和Qnode的每个点,而不是计算每个descendant point之间score
直到tree只有叶子的时候,call base case
!!:dual-tree algorithm = space tree+pruning dual-tree traversal+BaseCase() Score() 。进一步理解见link
小结:
kmeans,对初始中心点有依赖,可能会误差比较大,局部最优而非全局,迭代次数也受影响。优化要么是先用算法改进初始点的选择。如Canopy或层次聚类。要么则对k确定的方法,如使用类簇半径或直径作为指标,k小于真实值时,指标变化较大。还有就是scalable的问题,减少迭代或运算次数的方法,如上文所示,引进HMM或使用树结构来优化迭代过程。
以上是关于KMeans and optimization的主要内容,如果未能解决你的问题,请参考以下文章
当我应该使用其中之一时,“sklearn.cluster.k_means”和“sklearn.cluster.KMeans”有啥区别?
Kmeans聚类定义KMeans聚类的步骤Kmeans聚类常见问题及改进Kmeans聚类的变形Kmeans聚类的优缺点
Python:导入KMeans库失败;Kmeans报错及解决方法;NameError: name ‘KMeans‘ is not defined