PCA 分析后的特征/变量重要性

Posted 2023-02-16

技术标签:

【中文标题】PCA 分析后的特征/变量重要性【英文标题】：Feature/Variable importance after a PCA analysis 【发布时间】：2018-11-20 14:28:50 【问题描述】：

我已经对我的原始数据集进行了 PCA 分析，并且从由 PCA 转换的压缩数据集中，我还选择了我想要保留的 PC 数量（它们解释了几乎 94% 的方差）。现在我正在努力识别在缩减数据集中很重要的原始特征。如何找出降维后剩余的主成分中哪些特征是重要的，哪些不是？这是我的代码：

from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)

此外，我还尝试在缩减数据集上执行聚类算法，但令我惊讶的是，分数低于原始数据集。这怎么可能？

【问题讨论】：

第二个问题：当你降低维度时，你会丢失一些原始数据集中可用的信息。因此，与高维设置相比，您无法获得更好的性能也就不足为奇了（在大多数情况下）。 @fabio 好问题。看我的回答重要功能是什么意思？？在哪种情况下？ @fabio 看到我的答案，如果清楚，请告诉我 【参考方案1】：

首先，我假设您调用features 变量和not the samples/observations。在这种情况下，您可以通过创建一个在一个图中显示所有内容的 biplot 函数来执行以下操作。在本例中，我使用的是虹膜数据。

在示例之前，请注意使用PCA作为特征选择工具时的基本思想是根据其系数（载荷）的大小（绝对值从大到小）来选择变量强>。有关详细信息，请参阅情节后的最后一段。

概述：

第 1 部分：我解释了如何检查特征的重要性以及如何绘制双标图。

第 2 部分：我解释了如何检查特征的重要性以及如何使用特征名称将它们保存到 pandas 数据框中。

第 1 部分：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    

pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC".format(1))
plt.ylabel("PC".format(2))
plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

使用 biplot 可视化正在发生的事情

现在，每个特征的重要性通过特征向量中对应值的大小来反映（更高的大小 - 更高的重要性）

让我们先看看每台 PC 解释了多少差异。

pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]

PC1 explains 72% 和 PC2 23%。一起，如果我们只保留 PC1 和 PC2，他们会解释95%。

现在，让我们找出最重要的功能。

print(abs( pca.components_ ))

[[0.52237162 0.26335492 0.58125401 0.56561105]
 [0.37231836 0.92555649 0.02109478 0.06541577]
 [0.72101681 0.24203288 0.14089226 0.6338014 ]
 [0.26199559 0.12413481 0.80115427 0.52354627]]

这里，pca.components_ 的形状为 [n_components, n_features]。因此，通过查看第一行的PC1（第一主成分）：[0.52237162 0.26335492 0.58125401 0.56561105]]，我们可以得出结论，feature 1, 3 and 4（或双图中的 Var 1、3 和 4）是最重要的。这在双标图上也清晰可见（这就是为什么我们经常使用此图以直观的方式总结信息）。

综上所述，看k个最大特征值对应的特征向量分量的绝对值。在sklearn 中，组件按explained_variance_ 排序。这些绝对值越大，特定特征对该主成分的贡献就越大。

第 2 部分：

重要的特征是那些对组件影响更大的特征，因此在组件上具有很大的绝对值/分数。

要获取 PC 上最重要的功能以及名称并将它们保存到 pandas 数据框，请使用：

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = 'PC'.format(i): most_important_names[i] for i in range(n_pcs)

# build the dataframe
df = pd.DataFrame(dic.items())

打印如下：

     0  1
 0  PC0  e
 1  PC1  d

所以在 PC1 上名为 e 的功能最重要，而在 PC2 上名为 d。

这里也有不错的文章：https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

【讨论】：

感谢@seralouk 的回答。这完全有道理，但是，如果我必须选择足够好以保留前 3 台 PC 而不是只保留 PC1，那么在 [-0.72101681, 0.24203288, 0.14089226, 0.6338014 ] （第三行）中进行选择对于找出这么多电脑最重要的功能是什么？此外，作为“重要”，您只会选择具有正量级或有更准确决策标准的特征？你好。您应该保留 PC1 和 PC2，这样就足够了，因为它们解释了 95% 的方差。请参阅我的更新答案。就个人而言，我会看 PC3，因为它只解释了 3%！考虑支持我的回答。干杯是的，但我已经知道我必须保留多少台电脑。问题仍然是找到 PCA(n_components = 2) 的重要特征，也许我没有明白你的意思。假设我保留了 3 台 PC，我是否必须查看“pca.componets_”的第 3 个原始功能才能了解我想要保留的那些 PC 的每个原始功能的相关性？你必须先了解一些重要的东西。每个功能以不同的方式影响每台 PC。这意味着您只能绘制如下的共轭：feature 1, 3 and 4 are the most important/have the highest influence on PC1 和 feature 2 is the most important/has the highest influence on PC2 等 N 组件。在我的示例中，我将仅对 PC1 和 PC2 做出类似的归纳，因为这 2 台 PC 一起解释了 95% 的方差。现在清楚了吗？由于我的声望还不到 15，因此反馈已被记录但尚未公开显示。很快就会了:)【参考方案2】：

pca 库包含此功能。

pip install pca

提取特征重要性的演示如下：

# Import libraries
import numpy as np
import pandas as pd
from pca import pca

# Lets create a dataset with features that have decreasing variance. 
# We want to extract feature f1 as most important, followed by f2 etc
f1=np.random.randint(0,100,250)
f2=np.random.randint(0,50,250)
f3=np.random.randint(0,25,250)
f4=np.random.randint(0,10,250)
f5=np.random.randint(0,5,250)
f6=np.random.randint(0,4,250)
f7=np.random.randint(0,3,250)
f8=np.random.randint(0,2,250)
f9=np.random.randint(0,1,250)

# Combine into dataframe
X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]
X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])

# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)

# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])

#     PC      feature
# 0  PC1      f1
# 1  PC2      f2
# 2  PC3      f3
# 3  PC4      f4
# 4  PC5      f5
# 5  PC6      f6
# 6  PC7      f7
# 7  PC8      f8
# 8  PC9      f9

绘制解释方差

model.plot()

制作双标图。可以很好地看出，方差最大的第一个特征 (f1) 在图中几乎是水平的，而方差第二大的特征 (f2) 几乎是垂直的。这是意料之中的，因为大部分方差都在 f1 中，其次是 f2 等。

ax = model.biplot(n_feat=10, legend=False)

3d 中的双图。在这里，我们看到在 z 方向的图中很好地添加了预期的 f3。

ax = model.biplot3d(n_feat=10, legend=False)

【讨论】：

你怎么知道大部分方差都在特征 1 中？ @erdogant 因为 f1 的数据是在 0-100 范围内创建的 f1=np.random.randint(0,100,250)【参考方案3】：

# original_num_df the original numeric dataframe
# pca is the model
def create_importance_dataframe(pca, original_num_df):

    # Change pcs components ndarray to a dataframe
    importance_df  = pd.DataFrame(pca.components_)

    # Assign columns
    importance_df.columns  = original_num_df.columns

    # Change to absolute values
    importance_df =importance_df.apply(np.abs)

    # Transpose
    importance_df=importance_df.transpose()

    # Change column names again

    ## First get number of pcs
    num_pcs = importance_df.shape[1]

    ## Generate the new column names
    new_columns = [f'PCi' for i in range(1, num_pcs + 1)]

    ## Now rename
    importance_df.columns  =new_columns

    # Return importance df
    return importance_df

# Call function to create importance df
importance_df  =create_importance_dataframe(pca, original_num_df)

# Show first few rows
display(importance_df.head())

# Sort depending on PC of interest

## PC1 top 10 important features
pc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]
print(), print(f'PC1 top 10 feautres are \n')
display(pc1_top_10_features )

## PC2 top 10 important features
pc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]
print(), print(f'PC2 top 10 feautres are \n')
display(pc2_top_10_features )

【讨论】：

在创建 DataFrame 之前转置并获取 numpy 数组的绝对值可能更有效。

以上是关于PCA 分析后的特征/变量重要性的主要内容，如果未能解决你的问题，请参考以下文章