scipy.optimize + kmeans 聚类
Posted
技术标签:
【中文标题】scipy.optimize + kmeans 聚类【英文标题】:scipy.optimize + kmeans clustering 【发布时间】:2013-11-18 06:33:14 【问题描述】:我为一个项目实施的 kmeans 聚类算法设置如下:
import numpy as np
import scipy
import sys
import random
import matplotlib.pyplot as plt
import operator
class KMeansClass:
#takes in an npArray like object
def __init__(self,dataset,k):
self.dataset=np.array(dataset)
#initialize mins to maximum possible value
self.min_x = sys.maxint
self.min_y = sys.maxint
#initialize maxs to minimum possible value
self.max_x = -(sys.maxint)-1
self.max_y = -(sys.maxint)-1
self.k = k
#a is the coefficient matrix that is continually updated as the centroids of the clusters change respectively.
# It is an mxk matrix where each row corresponds to a training_instance and each column corresponds to a centroid of a cluster
#Values are either 0 or 1. A value for a particular training_instance (data_point) is 1 only for that centroid to which the training_instance
# has the least distance else the value is 0.
self.a = np.zeros(shape=[self.dataset.shape[0],self.k])
self.distanceMatrix = np.empty(shape =[self.dataset.shape[0],self.k])
#initialize mu to zeros of the requisite shape array for now. Change this after implementing max and min methods.
self.mu = np.empty(shape=[k,2])
self.findMinMaxdataPoints()
self.initializeCentroids()
self.createDistanceMatrix()
self.scatterPlotOfInitializedPoints()
#pointa and pointb are npArray like vecors.
def euclideanDistance(self,pointa,pointb):
return np.sqrt(np.sum((pointa - pointb)**2))
""" Problem Initialization And Visualization Helper methods"""
##############################################################################
#@param: dataset : list of tuples [(x1,y1),(x2,y2),...(xm,ym)]
def findMinMaxdataPoints(self):
for item in self.dataset:
self.min_x = min(self.min_x,item[0])
self.min_y = min(self.min_y,item[1])
self.max_x = max(self.max_x,item[0])
self.max_y = max(self.max_y,item[1])
def initializeCentroids(self):
for i in range(self.k):
#each value of mu is a tuple with a random number between (min_x - max_x) and (min_y - max_y)
self.mu[i] = (random.randint(self.min_x,self.max_x),random.randint(self.min_y,self.max_y))
self.sortCentroids()
print self.mu
def sortCentroids(self):
#the following 3 lines of code are to ensure that the mu values are always sorted in ascending order first with respect to the
#x values and then with respect to the y values.
half_sorted = sorted(self.mu,key=operator.itemgetter(1)) #sort wrt y values
full_sorted = sorted(half_sorted,key=operator.itemgetter(0)) #sort the y-sorted array wrt x-values
self.mu = np.array(full_sorted)
def scatterPlotOfInitializedPoints(self):
plt.scatter([item[0] for item in self.dataset],[item[1] for item in self.dataset],color='b')
plt.scatter([item[0] for item in self.mu],[item[1] for item in self.mu],color='r')
plt.show()
###############################################################################
#minimizing euclidean distance is the same as minimizing the square of the euclidean distance.
def calcSquareEuclideanDistanceBetweenTwoPoints(point_a,point_b):
return np.sum((pointa-pointb)**2)
def createDistanceMatrix(self):
for i in range(self.dataset.shape[0]):
for j in range(self.k):
self.distanceMatrix[i,j] = calcSquareEuclideanDistanceBetweenTwoPoints(self.dataset[i],self.mu[j])
def createCoefficientMatrix(self):
for i in range(self.dataset.shape[0]):
self.a[i,self.distanceMatrix[i].argmin()] = 1
#update functions for CoefficientMatrix and Centroid values:
def updateCoefficientMatrix(self):
for i in range(self.dataset.shape[0]):
self.a[i,self.distanceMatrix[i].argmin()]= 1
def updateCentroids(self):
for j in range(self.k):
non_zero_indices = np.nonzero(self.a[:,j])
avg = 0
for i in range(len(non_zero_indices[0])):
avg+=self.a[non_zero_indices[0][i],j]
self.mu[j] = avg/len(non_zero_indices[0])
############################################################
def lossFunction(self):
loss=0;
for j in range(self.k):
#vectorized this implementation.
loss+=np.sum(np.dot(self.a[:,j],self.distanceMatrix[:,j]))
return loss
我的问题与 lossFunction 以及如何将其与 scipy.optimize 包一起使用。我想通过执行以下步骤迭代地最小化损失函数:
Repeat until convergence:
a> Optimize 'a' by keeping mu constant ( I have an
updateCoefficientMatrix method for updating 'a' matrix which is an
mXk matrix where we have m training instances and k clusters.)
b> Optimize 'mu' by keeping 'a' constant (I have an updateCentroids
method to do this. where mu is a mXk matrix wherein m is number of
training instances and k is the number of clusters and the number of
centroids)
但是我对使用 scipy.optimize 包非常陌生,所以我写信寻求有关如何调用scipy.optimize
以实现上述优化目标的帮助?
基本上我有 2 个 m
xk
矩阵,我想通过首先优化一个 m
xk
矩阵来最小化 lossFunction()
矩阵,保持另一个不变,然后在后续步骤中优化第二个矩阵保持第一个不变。这可以被认为是期望最大化问题的一个特例,但不幸的是,到目前为止我还没有完全理解文档试图说的内容,因此我想向 SO 寻求帮助。
提前致谢!
这是课堂作业的一部分,所以请不要发布代码!任何指导或解释将不胜感激。
【问题讨论】:
【参考方案1】:使用不同的目标函数两次scipy.optimize.minimize
。
使用以a
为参数并返回目标值的目标函数进行首次运行优化。
作为第二步,在以mu
作为参数的第二个目标函数上再次运行scipy.optimize.minimize
。
在编写目标函数时,请记住 Python 具有嵌套函数,这避免了将mu
(在第一种情况下)或a
(在第二种情况下)作为附加参数传递的需要;虽然可以通过minimize(..., args=[mu])
和minimize(..., args=[a])
完成。
在 for 循环中重复两步过程,直到答案满足您的收敛条件。
【讨论】:
以上是关于scipy.optimize + kmeans 聚类的主要内容,如果未能解决你的问题,请参考以下文章
当 scipy.optimize.minimize 可能用于相同的事情时,为啥 scipy.optimize.least_squares 存在?
使用 scipy.optimize.minimize 提前停止损失函数
scipy.optimize.leastsq 用 NaN 调用目标函数
scipy.optimize.differential_evolution 的整数解约束?