markdown 第8课机器学习的特征选择
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了markdown 第8课机器学习的特征选择相关的知识,希望对你有一定的参考价值。
# Feature Selection For Machine Learning
Try to see what is the best columns to use for machine learning
1. Reduces Overfitting: Less redundant data means less opportunity to make decisions
based on noise.
2. Improves Accuracy: Less misleading data means modeling accuracy improves.
3. Reduces Training Time: Less data means that algorithms train faster.
Solutions
1. Univariate Selection.
2. Recursive Feature Elimination.
3. Principle Component Analysis.
4. Feature Importance.
## Uniraiate Selection
SelectKBest class, The example
below uses the chi-squared (chi2
) statistical test for non-negative features to select 4 of the best
features
```python
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])
```
```
[ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393
181.304]
[[ 148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[ 183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]
[ 137. 168. 43.1 33. ]]
```
You can see the scores for each attribute and the 4 attributes chosen (those with the highest
scores): plas, test, mass and age.
## Recursive Feature Elimination
works by recursively removing attributes and building a model on those attributes that remain The example below
uses RFE with the logistic regression algorithm to select the top 3 features
```python
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_
```
```
Num Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]
```
RFE chose the top 3 features as preg, mass and pedi.
# Principal Component Analysis
so confused with this one
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a
compressed form. Generally this is called a data reduction technique. A property of PCA is that
you can choose the number of dimensions or principal components in the transformed result. In
the example below, we use PCA and select 3 principal components
```python
# Feature Extraction with PCA
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)
```
```
Explained Variance: [ 0.88854663 0.06159078 0.02579012]
[[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]
[ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02
9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01]
[ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01
2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
```
## Feature Importance
can be used to estimate the importance
of features. In the example below we construct a **ExtraTreesClassifier** classifier
```python
# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)
```
```
[ 0.11069788 0.2412536 0.10503344 0.08299489 0.06598098 0.1310963 0.12077405 0.14216885]
```
The scores suggest at the importance of plas, age and mass.
以上是关于markdown 第8课机器学习的特征选择的主要内容,如果未能解决你的问题,请参考以下文章