如何在 sklearn 中使用 gower 距离实现 pam 聚类算法?

Posted

技术标签:

【中文标题】如何在 sklearn 中使用 gower 距离实现 pam 聚类算法?【英文标题】:How can I implement pam clustering algorithm using gower distance in sklearn? 【发布时间】:2021-06-01 03:27:12 【问题描述】:

我想使用高尔距离实现 pam (KMedoid, method='pam') 算法。

我的数据集包含混合特征,数字和分类,几个猫特征有 1000 多个不同的值。

我在这里找到了合适的高尔距离实现:https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py

我的问题是我使用的 sklearn-extra implementation of PAM 没有实现 metric='gower' 选项。所以我尝试创建一个可调用对象,但我似乎发现很难将它们连接在一起。

D = gower.gower_matrix(df_ext, cat_features=cat_mask) # cat_mask is a boolean list marking what the 
                                                    categorical features are in the df_ext

# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
def get_gower():
    return sklearn.metrics.pairwise_distances(D, metric='precomputed')

# https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
kmedoids = sklearn_extra.cluster.KMedoids(df_ext, metric=get_gower, method='pam')
kmedoids.fit(df_ext)

我得到这个 ValueError:

ValueError                                Traceback (most recent call last)
<ipython-input-13-9ae677cd636a> in <module>
      1 # https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
      2 kmedoids = KMedoids(df_ext, metric=get_gower, method='pam')
----> 3 kmedoids.fit(df_ext)

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in fit(self, X, y)
    183         random_state_ = check_random_state(self.random_state)
    184 
--> 185         self._check_init_args()
    186         X = check_array(X, accept_sparse=["csr", "csc"])
    187         if self.n_clusters > X.shape[0]:

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_init_args(self)
    154 
    155         # Check n_clusters and max_iter
--> 156         self._check_nonnegative_int(self.n_clusters, "n_clusters")
    157         self._check_nonnegative_int(self.max_iter, "max_iter", False)
    158 

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_nonnegative_int(self, value, desc, strict)
    144         else:
    145             negative = (value is None) or (value < 0)
--> 146         if negative or not isinstance(value, (int, np.integer)):
    147             raise ValueError(
    148                 "%s should be a nonnegative integer. "

D:\ProgramFiles\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1327 
   1328     def __nonzero__(self):
-> 1329         raise ValueError(
   1330             f"The truth value of a type(self).__name__ is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我认为我的可调用对象有问题。你知道我做错了什么吗?

【问题讨论】:

【参考方案1】:

在 Python 中使用 Gower 度量的 K-medoids (PAM)

数据类型:数值和分类变量 与 R 相比的结果 注意:在应用聚类之前考虑缩放您的数值数据。
import pandas as pd 
import numpy as np
import gower
from sklearn.preprocessing import LabelEncoder
from sklearn_extra.cluster import KMedoids

# Create a dataframe with both numeric and string type columns 

age = [21, 21, 19, 30, 21, 21, 19, 30, 35, 39, 50, 2]
gender = ['M', 'M', 'N', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M']
civil_status = ['MARRIED', 'SINGLE', 'SINGLE', 'SINGLE', 'MARRIED', 'SINGLE', 'WIDOW', 'DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED']
salary = [3000.0, 1200.0 , 32000.0, 1800.0 , 2900.0 , 1100.0 , 10000.0, 1500.0, 200.0, 500.0, 50.0, 5000.0]
available_credit = [2200, 100, 22000, 1100, 2000, 100, 6000, 2200, 6000, 12000, 500, 50]

df_eg = pd.DataFrame('age': age,
                 'gender': gender,
                  'civil_status': civil_status,
                 'salary': salary,
                 'available_credit': available_credit)
# Label encode categorical variables

df_eg_encoded = df_eg.copy() # Avoid Pandas error
df_eg_encoded[['gender', 'civil_status']] = df_eg_encoded[['gender', 'civil_status']].apply(LabelEncoder().fit_transform)


# Apply Gower distance calculation

gower_mat = gower.gower_matrix(df_eg,  cat_features = [False, True, True, False, False])
# Fit model
km_model = KMedoids(n_clusters = 3, random_state = 0, metric = 'precomputed', method = 'pam', init =  'k-medoids++').fit(gower_mat)  

clusters = km_model.labels_
clusters
> array([1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1], dtype=int64)

R 代码

install.packages("clusters")
age <- c(21,21,19, 30,21,21,19,30, 35, 39, 50, 2)
gender <- c('M','M','N','M','F','F','F','F', 'F', 'M', 'F', 'M')
civil_status <- c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED')
salary <-c (3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0, 200.0, 500.0, 50.0, 5000.0)
available_credit <- c (2200,100,22000,1100,2000,100,6000,2200, 6000, 12000, 500, 50)
X <- data.frame(age, gender, civil_status, salary, available_credit)
print(X)

library(cluster)
gower_mat <- daisy(X, metric = c("gower"))
pamx <- pam(gower_mat, 3)
print(pamx)
> Clustering vector:
> [1] 1 1 2 1 1 3 3 3 3 1 3 1

参考文献

https://pypi.org/project/gower/ https://scikit-learn-extra.readthedocs.io/en/stable/generated/sklearn_extra.cluster.KMedoids.html https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/daisy https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/pam

【讨论】:

【参考方案2】:

我想我找到了解决方案,但在我的数据集上速度很慢:

#  code implementation ideas are from here: https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py
# what I did basically is implemented gower_get to be usable for data sample by data sample calculation (this is what 
# scikit-learn-extra.KMedoids metric requires)

# NOTE: extremely slow on my data. Q: Would it be much easier to use a precomputed D distance matrix? - no, even slower...
def get_gower(x, y, cat_features=cat_mask):
    xi_cat = x[cat_features]
    xi_num = x[np.logical_not(cat_features)]
    xj_cat = y[cat_features]
    xj_num = y[np.logical_not(cat_features)]
    Z = np.array([x, y])
    Z_num = Z[:, np.logical_not(cat_features)]
#     print('Z.shape', Z.shape)
    weight = np.ones(Z.shape[1])
#     print('weight', weight.shape)
    feature_weight_cat = weight[cat_features]
    feature_weight_num = weight[np.logical_not(cat_features)]
    feature_weight_sum = weight.sum()
#     print('feature_weight_sum', feature_weight_sum.shape)
    categorical_features = np.array(cat_features)
    
    num_cols = Z_num.shape[1]
    num_ranges = np.zeros(num_cols)
    num_max = np.zeros(num_cols)
    
    for col in range(num_cols):
        col_array = Z_num[:, col].astype(np.float32) 
        max = np.nanmax(col_array)
        min = np.nanmin(col_array)
     
        if np.isnan(max):
            max = 0.0
        if np.isnan(min):
            min = 0.0
        num_max[col] = max
        num_ranges[col] = (1 - min / max) if (max != 0) else 0.0
        
    # categorical columns
    sij_cat = np.where(xi_cat == xj_cat, np.zeros_like(xi_cat), np.ones_like(xi_cat))
#     print('sij_cat', sij_cat.shape)
    sum_cat = np.multiply(feature_weight_cat,sij_cat).sum() 

    # numerical columns
    abs_delta=np.absolute(xi_num-xj_num)
    sij_num=np.divide(abs_delta, num_ranges, out=np.zeros_like(abs_delta), where=num_ranges!=0)

    sum_num = np.multiply(feature_weight_num,sij_num).sum()
    sums= np.add(sum_cat,sum_num)
    sum_sij = np.divide(sums,feature_weight_sum)
    
    return sum_sij

kmedoids = KMedoids(metric=get_gower, method='pam')
kmedoids.fit(df)

无论如何,我仍然愿意接受反馈,一定有更简单的方法:-)

【讨论】:

以上是关于如何在 sklearn 中使用 gower 距离实现 pam 聚类算法?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 weka 中使用高尔距离进行 KNN?

如何根据集群之间的距离阈值来决定集群的数量,以便使用 sklearn 进行凝聚聚类?

如何在sklearn dbscan中使用多个内核?

SKLearn:从决策边界获取每个点的距离?

Sklearn:到每个集群的质心的平均距离

sklearn SVM 默认距离测量