python代码在kmeans聚类后查找特征重要性

Posted 2023-03-12

技术标签:

【中文标题】python代码在kmeans聚类后查找特征重要性【英文标题】：python code to find feature importances after kmeans clustering 【发布时间】：2020-07-14 23:17:00 【问题描述】：

我研究了查找特征重要性的方法（我的数据集只有 9 个特征）。以下是这样做的两种方法，但是我很难编写 python 代码。

我希望对影响集群形成的每个特征进行排名。

计算每个维度的质心方差。具有最高方差的维度对于区分集群是最重要的。

如果您只有少量变量，您可以进行某种留一测试（删除 1 个变量并重做聚类）。另请记住，k-means 取决于初始化，因此您希望在重做聚类时保持不变。

有任何python代码来完成这个吗？

【问题讨论】：

this question and answers的一些相关讨论到目前为止，您尝试了什么，您的尝试出了什么问题？请提供minimal reproducible example 【参考方案1】：

考虑像这样进行特征选择。

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# UNIVARIATE SELECTION

data = pd.read_csv('C:\\Users\\Excel\\Desktop\\Briefcase\\PDFs\\1-ALL PYTHON & R CODE SAMPLES\\Feature Selection - Machine Learning\\train.csv')
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features


# FEATURE IMPORTANCE
data = pd.read_csv('C:\\your_path\\train.csv')
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

# Correlation Matrix with Heatmap
data = pd.read_csv('C:\\your_path\\train.csv')
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

Dataset is available here:

https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv

【讨论】：

但是问题是关于上面的 k 均值是一个有监督的问题吗？两者如何联系这个答案只有在你知道你的因变量时才有效，这表明一个有监督的问题，而不是无监督的 k-means 聚类的情况哦，不。你说的对。很抱歉！【参考方案2】：

评论显然是文档的复制和粘贴，但没有听取问题的需要。这些解决方案来自 scikit learn 用户指南的监督学习部分。

【讨论】：

【参考方案3】：

假设我们有 200 个样本和 9 个变量的 X，并综合使它们有两个聚类，为了可视化，我们每次填充其中两个变量。

import numpy as np
import matplotlib.pyplot as plt
import sklearn
X = np.zeros((200,4))

Feature1_1 = np.random.normal(loc=40, scale=1.0, size=100)
Feature1_2 = np.random.normal(loc=70, scale=3.0, size=100)

Feature2_1 = np.random.normal(loc=20, scale=4.0, size=100)
Feature2_2 = np.random.normal(loc=50, scale=1.0, size=100)

X[:100,0]=Feature1_1
X[100:,0]=Feature1_2
X[:100,1]=Feature2_1
X[100:,1]=Feature2_2

plt.figure(figsize = (5,5))
plt.scatter(X[:,0],X[:,1])
plt.grid()
plt.xlabel('Feature 2',fontsize=18)
plt.ylabel('Feature 1',fontsize=18)

现在，让我们填充一个具有更高方差的新特征。

Feature3_1 = np.random.normal(loc=40, scale=300.0, size=100)
Feature3_2 = np.random.normal(loc=43, scale=280.0, size=100)

Feature2_1 = np.random.normal(loc=20, scale=4.0, size=100)
Feature2_2 = np.random.normal(loc=50, scale=1.0, size=100)


X[:100,2]=Feature3_1
X[100:,2]=Feature3_2

X[:100,1]=Feature2_1
X[100:,1]=Feature2_2

plt.figure(figsize = (5,5))
plt.scatter(X[:,2],X[:,1])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

最后一个方差也更大

Feature3_1 = np.random.normal(loc=40, scale=300.0, size=100)
Feature3_2 = np.random.normal(loc=43, scale=280.0, size=100)

Feature4_1 = np.random.normal(loc=20, scale=40.0, size=100)
Feature4_2 = np.random.normal(loc=22, scale=40.0, size=100)


X[:100,2]=Feature3_1
X[100:,2]=Feature3_2

X[:100,3]=Feature4_1
X[100:,3]=Feature4_2

plt.figure(figsize = (5,5))
plt.scatter(X[:,2],X[:,3])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 4',fontsize=18)

现在，让我们用 k-means 对它们进行聚类

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

现在，让我们可视化这些集群。

f1=0
f2=1

plt.figure(figsize = (5,5))
plt.scatter(X[kmeans.labels_==0][:,f1],X[kmeans.labels_==0][:,f2])
plt.scatter(X[kmeans.labels_==1][:,f1],X[kmeans.labels_==1][:,f2])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

f1=2
f2=1

plt.figure(figsize = (5,5))
plt.scatter(X[kmeans.labels_==0][:,f1],X[kmeans.labels_==0][:,f2])
plt.scatter(X[kmeans.labels_==1][:,f1],X[kmeans.labels_==1][:,f2])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

f1=2
f2=3

plt.figure(figsize = (5,5))
plt.scatter(X[kmeans.labels_==0][:,f1],X[kmeans.labels_==0][:,f2])
plt.scatter(X[kmeans.labels_==1][:,f1],X[kmeans.labels_==1][:,f2])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

我们现在可以非常清楚地看到特征 3 和 4 是唯一重要的。 请注意，归一化的特征会导致完全不同的结果。

最后，我们通过以下方式实现自动化：

for feature in range(X.shape[1]):
    mean1 = X[kmeans.labels_==0][:,feature].mean()
    mean2 = X[kmeans.labels_==1][:,feature].mean()
    
    var1 = X[kmeans.labels_==0][:,feature].var()
    var2 = X[kmeans.labels_==1][:,feature].var()
    
    print('feature:',feature,'Mean difference:',round(abs(mean1-mean2),3),'Total Variance:',round((var1+var2),3))

导致：

feature: 0 Mean difference: 1.69 Total Variance: 459.464 
feature: 1 Mean difference: 0.879 Total Variance: 449.829 
feature: 2 Mean difference: 66.213 Total Variance: 154932.184 
feature: 3 Mean difference: 2.076 Total Variance: 2731.953

【讨论】：

以上是关于python代码在kmeans聚类后查找特征重要性的主要内容，如果未能解决你的问题，请参考以下文章