Sklearn Chi2 用于特征选择

Posted 2023-03-12

技术标签:

【中文标题】Sklearn Chi2 用于特征选择【英文标题】：Sklearn Chi2 For Feature Selection 【发布时间】：2019-01-12 17:07:10 【问题描述】：

我正在学习 chi2 进行特征选择，遇到了 this 之类的代码

但是，我对 chi2 的理解是，较高的分数意味着该特征更多独立（因此对模型的用处较小），因此我们会对分数最低的特征感兴趣。然而，使用 scikit 学习 SelectKBest，选择器返回具有最高 chi2 分数的值。我对使用 chi2 测试的理解不正确吗？或者 sklearn 中的 chi2 分数是否会产生 chi2 统计以外的其他内容？

我的意思见下面的代码（大部分是从上面的链接复制的，除了结尾）

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import numpy as np

# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores

# you can see that the kbest returned from SelectKBest 
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest

【问题讨论】：

【参考方案1】：

你的理解是相反的。

chi2 检验的原假设是“两个分类变量是独立的”。因此 chi2 统计量值越高，意味着“两个分类变量相互依赖”，对分类更有用。

SelectKBest 为您提供基于较高 chi2 值的最佳两个 (k=2) 功能。因此，您需要获取它提供的那些特性，而不是获取 chi2 选择器上的“其他特性”。

从 chi2_selector.scores_ 获得 chi2 统计数据并从 chi2_selector.get_support() 获得最佳特征是正确的。根据独立性测试的 chi2 测试，它将为您提供“花瓣长度 (cm)”和“花瓣宽度 (cm)”作为前 2 个特征。希望它能澄清这个算法。

【讨论】：

对于非正常数据，chi2 是否比 f_classif 评分函数更好？

以上是关于Sklearn Chi2 用于特征选择的主要内容，如果未能解决你的问题，请参考以下文章

特征选择（即 chi2 方法）产生的 p 值是啥意思？ [关闭]

Scipy 和 Sklearn chi2 实现给出不同的结果

Sklearn MLP 特征选择

sklearn 逻辑回归中的特征选择

从 selectKbest 中获取特征名称

sklearn-特征工程之特征选择