层级聚类算法python实现

Posted 2021-04-08 生信学习

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了层级聚类算法python实现相关的知识，希望对你有一定的参考价值。

python实现对50个基因表达量层级聚类算法

聚类算法python实现

层级聚类

Hierarchical_Clustering.py

思路

通过50行基因，实验组和对照组两列，每组各3个重复的平均表达量，根据欧式距离计算距离矩阵DE
对距离矩阵进行层级聚类

层级聚类算法伪代码：

Hierarchical_Clustering（d,n）

形成n个类，每个类含有一个元素
构建树型图，为每个类分配一个单独的顶点
while 存在多于一个类
找到最近的两个雷C1和C2
将C1和C2合并成一个新的类C,Chanyou |C1|+|C2|个元素
计算C与其他各类的距离
在树形图中增加一个顶点C,且与C1和C2相连
在距离矩阵中删除与C1和C2相对应的行和列
在距离矩阵中为新类增加一行一列
return T

# -*- coding: UTF-8 -*-

import pandas as pd
import numpy as np
import math,os

os.chdir(r'F:\pycharm_project\cluster')
data = pd.read_csv(r'.\microarray_gcrma_diff_TOP50.csv',nrows = 51,header=None)
Exp_matrix = np.array(data.iloc[1:51,1:3],dtype =np.float64)

def Euclidean_Distances(A):
    n = A.shape[0]
    mydist = np.zeros((n, n))
    for i in range(0,50):
        for j in range(0,50):
            d = math.sqrt((A[i][0]-A[j][0])**2+(A[i][1]-A[j][1])**2)
            mydist[i,j] = ("%.2f" % d)
    return mydist

DE = Euclidean_Distances(Exp_matrix)
#print(DE)

def find_min_index_list(A):
    if 0 in A:
        mask = A ==0
        A[mask == True] =np.inf
    else:
        pass
    dis_min = A.min()
    dis_min_pos = np.where(A ==dis_min)
    pos_list = []
    for i in range(0,len(dis_min_pos[0])):
        for j in range(0, len(dis_min_pos[0])):
            if i > j:
                pos_list.append([dis_min_pos[0][i],dis_min_pos[1][i]])
            else:
                pass
    return pos_list


#min_index_list = find_min_index(DE)
#print(min_index_list)

def delete_list(mylist,index_list):
    for i in range(0,len(index_list)):
        if i == 0:
            del mylist[index_list[i]]
        else:
            del mylist[index_list[i]-1]
    return mylist

def class_Euclidean_min_Distances(min_index,A):
    #calculate class distance
    min_row1 = A[min_index[0],]
    min_row2 = A[min_index[1],]
    class_dis_to_other = np.fmin(min_row1,min_row2)

    #delete distance list element
    mylist = list(class_dis_to_other)
    mylist_del = delete_list(mylist,list(min_index))

    #delete distance matrix
    A_del_row = np.delete(A, [min_index[0], min_index[1]], 0)
    A_del_column = np.delete(A_del_row, [min_index[0], min_index[1]], 1)

    #add distance list
    A_add_row = np.row_stack((A_del_column,mylist_del))
    mylist_del.append(np.inf)
    A_add_column = np.column_stack((A_add_row,mylist_del ))
    return A_add_column

#min_index = split_min_index(min_index_list)
#DE_new = class_Euclidean_min_Distances(min_index,DE)
#print(DE_new)

#min_index_list = find_min_index(A)
#min_index = split_min_index(min_index_list)
#A = class_Euclidean_min_Distances(min_index[0],A)
#A = class_Euclidean_min_Distances(min_index[1],A)
#print(A)


def hierarchical_clustering(A,N):
    n_class = len(A)
    n_layer = 1
    while n_class > N:
        min_index_list = find_min_index_list(A)
        for i in range(0,int(len(min_index_list))):
            A = class_Euclidean_min_Distances(min_index_list[i], A)
            result = "the " + str(n_layer) + " layer class process is " + "{g" + str(
                min_index_list[i][0]) + "," + "g" + str(min_index_list[i][1]) + "}" + "--->" + "g" + str(n_class-1)
            n_class = n_class - 1
            print(result)
        n_layer +=1
    return A

 A = hierarchical_clustering(DE,1)

Hierarchical_Clustering输出结果及解释

the 1 layer class process is {g38,g17}--->g49
the 2 layer class process is {g48,g16}--->g48
the 3 layer class process is {g47,g15}--->g47
the 4 layer class process is {g46,g14}--->g46
the 5 layer class process is {g45,g13}--->g45
the 6 layer class process is {g44,g12}--->g44
the 7 layer class process is {g43,g11}--->g43
the 8 layer class process is {g42,g10}--->g42
the 9 layer class process is {g41,g9}--->g41
the 10 layer class process is {g40,g8}--->g40
the 11 layer class process is {g39,g7}--->g39
the 12 layer class process is {g38,g6}--->g38
the 13 layer class process is {g37,g5}--->g37
the 14 layer class process is {g36,g4}--->g36
the 15 layer class process is {g35,g3}--->g35
the 16 layer class process is {g34,g2}--->g34
the 17 layer class process is {g33,g1}--->g33
the 18 layer class process is {g32,g0}--->g32
the 19 layer class process is {g31,g0}--->g31
the 20 layer class process is {g30,g0}--->g30
the 21 layer class process is {g29,g0}--->g29
the 22 layer class process is {g28,g0}--->g28
the 23 layer class process is {g27,g0}--->g27
the 24 layer class process is {g26,g0}--->g26
the 25 layer class process is {g25,g0}--->g25
the 26 layer class process is {g24,g0}--->g24
the 27 layer class process is {g23,g0}--->g23
the 28 layer class process is {g22,g0}--->g22
the 29 layer class process is {g21,g0}--->g21
the 30 layer class process is {g20,g0}--->g20
the 31 layer class process is {g19,g0}--->g19
the 32 layer class process is {g18,g0}--->g18
the 33 layer class process is {g17,g0}--->g17
the 34 layer class process is {g16,g0}--->g16
the 35 layer class process is {g15,g0}--->g15
the 36 layer class process is {g14,g0}--->g14
the 37 layer class process is {g13,g0}--->g13
the 38 layer class process is {g12,g0}--->g12
the 39 layer class process is {g11,g0}--->g11
the 40 layer class process is {g10,g0}--->g10
the 41 layer class process is {g9,g0}--->g9
the 42 layer class process is {g8,g0}--->g8
the 43 layer class process is {g7,g0}--->g7
the 44 layer class process is {g6,g0}--->g6
the 45 layer class process is {g5,g0}--->g5
the 46 layer class process is {g4,g0}--->g4
the 47 layer class process is {g3,g0}--->g3
the 48 layer class process is {g2,g0}--->g2
the 49 layer class process is {g1,g0}--->g1

层级聚类过程为：例如{g38,g17}--->g49 表示原50个基因列表中第38个基因和第17个基因聚为一类，插入到剩下48个基因的末尾，作为新的第49个“基因”。以此类推，每次对两个基因聚为一类，插到剩下基因的末尾。

以上是关于层级聚类算法python实现的主要内容，如果未能解决你的问题，请参考以下文章