机器学习实战精读--------奇异值分解(SVD)
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了机器学习实战精读--------奇异值分解(SVD)相关的知识,希望对你有一定的参考价值。
奇异值分解(SVD):是一种强大的降维工具,通过利用SVD来逼近矩阵并从中提取重要特征,通过保留矩阵80%~ 90%的能量,就能得到重要的特征并去掉噪声
SVD分解会降低程序的速度,大型系统中SVD每天运行一次或者频率更低,并且还要离线进行。
隐性语义索引(LST):试图绕过自然语言理解,用统计的办法得到相同的目标
隐性语义分析(LSA):LSA的基本思想就是把高维的文档降到低维空间,那个空间被称为潜在语义空间
协同过滤:通过将用户和其它用户的数据进行对比来实现推荐。
协同过滤的缺点:
① 用户对商品的评价非常稀疏,这样基于用户评价得到的用户间的相似性可能不准确(稀疏性问题)
② 随着用户和商品的增多,系统的性能会越来越低
⑤ 如果从来没有用户对一个商品进行评价,则这个商品就不可能被推荐(最初评价问题)
如何选择是基于物品的相似度还是基于用户的相似度?
基于数目小的。
冷启动的解决方法:把推荐看做是搜索问题。
协同过滤的核心:相似度计算。
#coding:utf-8 from numpy import * from numpy import linalg as la def loadExData(): return[[0, 0, 0, 2, 2], [0, 0, 0, 3, 3], [0, 0, 0, 1, 1], [1, 1, 1, 0, 0], [2, 2, 2, 0, 0], [5, 5, 5, 0, 0], [1, 1, 1, 0, 0]] def loadExData2(): return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5], [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3], [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0], [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0], [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0], [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0], [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1], [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4], [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2], [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0], [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]] #相似度=1/(1+距离) def ecludSim(inA,inB): return 1.0/(1.0 + la.norm(inA - inB)) #皮尔逊相关系数,它度量的是两个向量之间的相似度 def pearsSim(inA,inB): if len(inA) < 3 : return 1.0 return 0.5+0.5*corrcoef(inA, inB, rowvar = 0)[0][1] #皮尔逊相关系数计算是由函数corrcoef进行的。 #余弦相似度,计算两个向量夹角的余弦值,夹角为90度,相似度为0;如果两个向量方向相同,相似度为1.0 def cosSim(inA,inB): num = float(inA.T*inB) denom = la.norm(inA)*la.norm(inB) return 0.5+0.5*(num/denom) #计算给定相似度计算方法下,用户对物品的估计评分值 def standEst(dataMat, user, simMeas, item): n = shape(dataMat)[1] simTotal = 0.0; ratSimTotal = 0.0 for j in range(n): userRating = dataMat[user,j] if userRating == 0: continue overLap = nonzero(logical_and(dataMat[:,item].A>0, dataMat[:,j].A>0))[0] if len(overLap) == 0: similarity = 0 else: similarity = simMeas(dataMat[overLap,item], dataMat[overLap,j]) print ‘the %d and %d similarity is: %f‘ % (item, j, similarity) simTotal += similarity ratSimTotal += similarity * userRating if simTotal == 0: return 0 else: return ratSimTotal/simTotal # def svdEst(dataMat, user, simMeas, item): n = shape(dataMat)[1] simTotal = 0.0; ratSimTotal = 0.0 U,Sigma,VT = la.svd(dataMat) Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix xformedItems = dataMat.T * U[:,:4] * Sig4.I #create transformed items for j in range(n): userRating = dataMat[user,j] if userRating == 0 or j==item: continue similarity = simMeas(xformedItems[item,:].T, xformedItems[j,:].T) print ‘the %d and %d similarity is: %f‘ % (item, j, similarity) simTotal += similarity ratSimTotal += similarity * userRating if simTotal == 0: return 0 else: return ratSimTotal/simTotal #产生最高的N个推荐结果 def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst): unratedItems = nonzero(dataMat[user,:].A==0)[1]#find unrated items if len(unratedItems) == 0: return ‘you rated everything‘ itemScores = [] for item in unratedItems: estimatedScore = estMethod(dataMat, user, simMeas, item) itemScores.append((item, estimatedScore)) return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N] #打印矩阵 def printMat(inMat, thresh=0.8): for i in range(32): for k in range(32): if float(inMat[i,k]) > thresh: print 1, #元素大于阀值打印1,否则打印0 else: print 0, print ‘‘ #图像的压缩,它允许基于任意给定的奇异值数目来重构对象 def imgCompress(numSV=3, thresh=0.8): myl = [] for line in open(‘0_5.txt‘).readlines(): newRow = [] for i in range(32): newRow.append(int(line[i])) myl.append(newRow) myMat = mat(myl) print "****original matrix******" printMat(myMat, thresh) U,Sigma,VT = la.svd(myMat) SigRecon = mat(zeros((numSV, numSV))) for k in range(numSV):#construct diagonal matrix from vector SigRecon[k,k] = Sigma[k] reconMat = U[:,:numSV]*SigRecon*VT[:numSV,:] print "****reconstructed matrix using %d singular values******" % numSV if __name__ == ‘__main__‘: imgCompress(2)
脚本执行输出:
****original matrix******
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
****reconstructed matrix using 2 singular values******
本文出自 “付炜超” 博客,谢绝转载!
以上是关于机器学习实战精读--------奇异值分解(SVD)的主要内容,如果未能解决你的问题,请参考以下文章
[机器学习笔记]奇异值分解SVD简介及其在推荐系统中的简单应用