如何在 sklearn 中使用 gower 距离实现 pam 聚类算法?
Posted
技术标签:
【中文标题】如何在 sklearn 中使用 gower 距离实现 pam 聚类算法?【英文标题】:How can I implement pam clustering algorithm using gower distance in sklearn? 【发布时间】:2021-06-01 03:27:12 【问题描述】:我想使用高尔距离实现 pam (KMedoid, method='pam') 算法。
我的数据集包含混合特征,数字和分类,几个猫特征有 1000 多个不同的值。
我在这里找到了合适的高尔距离实现:https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py
我的问题是我使用的 sklearn-extra implementation of PAM 没有实现 metric='gower'
选项。所以我尝试创建一个可调用对象,但我似乎发现很难将它们连接在一起。
D = gower.gower_matrix(df_ext, cat_features=cat_mask) # cat_mask is a boolean list marking what the
categorical features are in the df_ext
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
def get_gower():
return sklearn.metrics.pairwise_distances(D, metric='precomputed')
# https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
kmedoids = sklearn_extra.cluster.KMedoids(df_ext, metric=get_gower, method='pam')
kmedoids.fit(df_ext)
我得到这个 ValueError:
ValueError Traceback (most recent call last)
<ipython-input-13-9ae677cd636a> in <module>
1 # https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
2 kmedoids = KMedoids(df_ext, metric=get_gower, method='pam')
----> 3 kmedoids.fit(df_ext)
D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in fit(self, X, y)
183 random_state_ = check_random_state(self.random_state)
184
--> 185 self._check_init_args()
186 X = check_array(X, accept_sparse=["csr", "csc"])
187 if self.n_clusters > X.shape[0]:
D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_init_args(self)
154
155 # Check n_clusters and max_iter
--> 156 self._check_nonnegative_int(self.n_clusters, "n_clusters")
157 self._check_nonnegative_int(self.max_iter, "max_iter", False)
158
D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_nonnegative_int(self, value, desc, strict)
144 else:
145 negative = (value is None) or (value < 0)
--> 146 if negative or not isinstance(value, (int, np.integer)):
147 raise ValueError(
148 "%s should be a nonnegative integer. "
D:\ProgramFiles\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1327
1328 def __nonzero__(self):
-> 1329 raise ValueError(
1330 f"The truth value of a type(self).__name__ is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
我认为我的可调用对象有问题。你知道我做错了什么吗?
【问题讨论】:
【参考方案1】:在 Python 中使用 Gower 度量的 K-medoids (PAM)
数据类型:数值和分类变量 与 R 相比的结果 注意:在应用聚类之前考虑缩放您的数值数据。import pandas as pd
import numpy as np
import gower
from sklearn.preprocessing import LabelEncoder
from sklearn_extra.cluster import KMedoids
# Create a dataframe with both numeric and string type columns
age = [21, 21, 19, 30, 21, 21, 19, 30, 35, 39, 50, 2]
gender = ['M', 'M', 'N', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M']
civil_status = ['MARRIED', 'SINGLE', 'SINGLE', 'SINGLE', 'MARRIED', 'SINGLE', 'WIDOW', 'DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED']
salary = [3000.0, 1200.0 , 32000.0, 1800.0 , 2900.0 , 1100.0 , 10000.0, 1500.0, 200.0, 500.0, 50.0, 5000.0]
available_credit = [2200, 100, 22000, 1100, 2000, 100, 6000, 2200, 6000, 12000, 500, 50]
df_eg = pd.DataFrame('age': age,
'gender': gender,
'civil_status': civil_status,
'salary': salary,
'available_credit': available_credit)
# Label encode categorical variables
df_eg_encoded = df_eg.copy() # Avoid Pandas error
df_eg_encoded[['gender', 'civil_status']] = df_eg_encoded[['gender', 'civil_status']].apply(LabelEncoder().fit_transform)
# Apply Gower distance calculation
gower_mat = gower.gower_matrix(df_eg, cat_features = [False, True, True, False, False])
# Fit model
km_model = KMedoids(n_clusters = 3, random_state = 0, metric = 'precomputed', method = 'pam', init = 'k-medoids++').fit(gower_mat)
clusters = km_model.labels_
clusters
> array([1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1], dtype=int64)
R 代码
install.packages("clusters")
age <- c(21,21,19, 30,21,21,19,30, 35, 39, 50, 2)
gender <- c('M','M','N','M','F','F','F','F', 'F', 'M', 'F', 'M')
civil_status <- c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED')
salary <-c (3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0, 200.0, 500.0, 50.0, 5000.0)
available_credit <- c (2200,100,22000,1100,2000,100,6000,2200, 6000, 12000, 500, 50)
X <- data.frame(age, gender, civil_status, salary, available_credit)
print(X)
library(cluster)
gower_mat <- daisy(X, metric = c("gower"))
pamx <- pam(gower_mat, 3)
print(pamx)
> Clustering vector:
> [1] 1 1 2 1 1 3 3 3 3 1 3 1
参考文献
https://pypi.org/project/gower/ https://scikit-learn-extra.readthedocs.io/en/stable/generated/sklearn_extra.cluster.KMedoids.html https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/daisy https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/pam
【讨论】:
【参考方案2】:我想我找到了解决方案,但在我的数据集上速度很慢:
# code implementation ideas are from here: https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py
# what I did basically is implemented gower_get to be usable for data sample by data sample calculation (this is what
# scikit-learn-extra.KMedoids metric requires)
# NOTE: extremely slow on my data. Q: Would it be much easier to use a precomputed D distance matrix? - no, even slower...
def get_gower(x, y, cat_features=cat_mask):
xi_cat = x[cat_features]
xi_num = x[np.logical_not(cat_features)]
xj_cat = y[cat_features]
xj_num = y[np.logical_not(cat_features)]
Z = np.array([x, y])
Z_num = Z[:, np.logical_not(cat_features)]
# print('Z.shape', Z.shape)
weight = np.ones(Z.shape[1])
# print('weight', weight.shape)
feature_weight_cat = weight[cat_features]
feature_weight_num = weight[np.logical_not(cat_features)]
feature_weight_sum = weight.sum()
# print('feature_weight_sum', feature_weight_sum.shape)
categorical_features = np.array(cat_features)
num_cols = Z_num.shape[1]
num_ranges = np.zeros(num_cols)
num_max = np.zeros(num_cols)
for col in range(num_cols):
col_array = Z_num[:, col].astype(np.float32)
max = np.nanmax(col_array)
min = np.nanmin(col_array)
if np.isnan(max):
max = 0.0
if np.isnan(min):
min = 0.0
num_max[col] = max
num_ranges[col] = (1 - min / max) if (max != 0) else 0.0
# categorical columns
sij_cat = np.where(xi_cat == xj_cat, np.zeros_like(xi_cat), np.ones_like(xi_cat))
# print('sij_cat', sij_cat.shape)
sum_cat = np.multiply(feature_weight_cat,sij_cat).sum()
# numerical columns
abs_delta=np.absolute(xi_num-xj_num)
sij_num=np.divide(abs_delta, num_ranges, out=np.zeros_like(abs_delta), where=num_ranges!=0)
sum_num = np.multiply(feature_weight_num,sij_num).sum()
sums= np.add(sum_cat,sum_num)
sum_sij = np.divide(sums,feature_weight_sum)
return sum_sij
kmedoids = KMedoids(metric=get_gower, method='pam')
kmedoids.fit(df)
无论如何,我仍然愿意接受反馈,一定有更简单的方法:-)
【讨论】:
以上是关于如何在 sklearn 中使用 gower 距离实现 pam 聚类算法?的主要内容,如果未能解决你的问题,请参考以下文章