scikit 中的特征选择学习多个变量和数千个特征
Posted
技术标签:
【中文标题】scikit 中的特征选择学习多个变量和数千个特征【英文标题】:Feature selection in scikit learn for multiple variables and thousands+ features 【发布时间】:2016-01-24 01:56:09 【问题描述】:我正在尝试为逻辑回归分类器执行特征选择。最初有 4 个变量:姓名、位置、性别和标签 = 种族。三个变量,即名字,产生了数以万计的“特征”,例如名字“John Snow”会产生像'jo'、'oh'、'hn'这样的2个字母的子串。 .etc. 特征集经过 DictVectorization。
我正在尝试遵循本教程 (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html),但我不确定我是否做得对,因为本教程使用少量功能,而我的矢量化后有数万个。而且 plt.show() 显示一个空白图。
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import random
import time
from random import randint
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_avg_wordLength(nameString):
try:
space = 0
for i in nameString:
if i == ' ':
space += 1
length = float(len(nameString) - space)
name_entity = float(space + 1)
avg = round(float(length/name_entity), 0)
return avg
except:
return 0
def feature_name_entity(nameString2):
space = 0
try:
for i in nameString2:
if i == ' ':
space += 1
return space+1
except: return 0
def feature_gender(genString):
try:
gender = genString
if len(gender) >= 1:
return gender
else: return '?'
except: return '?'
def feature_noNeighborLoc(locString):
try:
x = re.sub(r'^[^, ]*', '', locString) # remove everything before and include first ','
y = x[2:] # remove subsequent ',' and ' '
return y
except: return '?'
def list_to_dict(substring_list):
try:
substring_dict =
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict13 = ['name-entity': feature_name_entity(feature_full_name(i)) for i in X2]
my_dict14 = ['avg-length': feature_avg_wordLength(feature_full_name(i)) for i in X]
my_dict15 = ['gender': feature_full_name(i) for i in X3]
my_dict16 = ['location': feature_noNeighborLoc(feature_full_name(i)) for i in X4]
my_dict17 = ['dummy1': 1 for i in X]
my_dict18 = ['dummy2': random.randint(0,2) for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(my_dict13[i].items() + my_dict14[i].items()
+ my_dict15[i].items() + my_dict16[i].items() + my_dict17[i].items() + my_dict18[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
lr = LogisticRegression()
lr.fit(X_train, y_train)
dv = DictVectorizer()
# Feature selection
plt.figure(1)
plt.clf()
X_indices = np.arange(X_train.shape[-1])
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
label=r'Univariate score ($-Log(p_value)$)', color='g')
plt.show()
警告:
E:\Program Files Extra\Python27\lib\site-packages\sklearn\feature_selection\univariate_selection.py:111: UserWarning: Features [[0 0 0 ..., 0 0 0]] are constant.
【问题讨论】:
没有错误轨迹。只有警告(上图),它能够生成(但为空)图表。 【参考方案1】:您将数据拆分为训练集和测试集的方式似乎不起作用:
# Separate the training and testing data sets
X_train = newX[:half_cut]
X_test = newX[half_cut:]
如果你已经在使用 sklearn,那么使用内置的分割例程会更方便:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, random_state=0)
【讨论】:
以上是关于scikit 中的特征选择学习多个变量和数千个特征的主要内容,如果未能解决你的问题,请参考以下文章
编码风格的监督学习 - 特征选择(Scikit Learn)
Scikit 学习模型赋予随机变量权重?我应该删除不太重要的功能吗?