使用 Python Scikit-learn 中的 Pipeline 和 featureUnion 将多个功能合二为一
Posted
技术标签:
【中文标题】使用 Python Scikit-learn 中的 Pipeline 和 featureUnion 将多个功能合二为一【英文标题】:Multiple features into one using Pipeline and featureUnion from Python Scikit-learn 【发布时间】:2016-02-01 15:08:50 【问题描述】:我想训练和预测一个人的性别。我有两个特征“name”和“randint”,每个特征都来自不同的 Pandas 列。我正在尝试 1) 将它们组合成一个管道/功能联合。以及 2) 将预测标签添加到原始 pandas 数据帧中。虽然我收到了前一个目标 1) 的错误:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.cross_validation import train_test_split
from sklearn.base import TransformerMixin
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import FeatureUnion
import numpy as np
clf = make_pipeline(CountVectorizer(), LogisticRegressionCV(cv=2))
data =
'Bruce Lee': 'Male',
'Bruce Banner': 'Male',
'Peter Parker': 'Male',
'Peter Poker': 'Male',
'Peter Springsteen': 'Male',
'Bruce Willis': 'Male',
'Sarah McLaughlin': 'Female',
'Sarah Silverman': 'Female',
'Sarah Palin': 'Female',
'Sarah Hyland': 'Female',
'Bruce Li': 'Male',
'Bruce Milk': 'Male',
'Bruce Springsteen': 'Male',
'Bruce Willis': 'Male',
'Sally Juice': 'Female',
'Sarah Silverwoman': 'Female',
'Sarah Palin': 'Female',
'Sarah Hyland': 'Female',
'Bruce Paul': 'Male',
'Bruce Lame': 'Male',
'Bruce Springsteen': 'Male',
'Bruce Willis': 'Male',
'Sarah Willis': 'Female',
'Sarah Goldman': 'Female',
'Sarah Palin': 'Female',
'Sally Hyland': 'Female',
'Bruce McDonald': 'Male',
'Bruce Lane': 'Male',
'Peter Springsteen': 'Male',
'Bruce Willis': 'Male',
'Sarah McLaughlin': 'Female',
'Sarah Goldwoman': 'Female',
'Sarah Palin': 'Female',
'Sarah Hylie': 'Female'
df = pd.DataFrame.from_dict(data, orient='index').reset_index()
df.columns = ['name', 'gender']
df['randomInt'] = np.random.choice(range(1, 6), df.shape[0])
class ExtractNames(TransformerMixin):
def transform(self, X, *args):
return ['first': name.split()[0],
'last': name.split()[-1]
for name in X]
def fit(self, *args):
return self
class ExtractRandInt(TransformerMixin):
def transform(self, X2, *args):
return ['randInt': num for num in X2]
def fit(self, *args):
return self
trans = ExtractNames()
trans2 = ExtractRandInt()
Combined = FeatureUnion([trans, trans2])
clf = make_pipeline(Combined(), DictVectorizer(), LogisticRegressionCV())
df_train, df_test = train_test_split(df, train_size=0.5, random_state=68)
clf.fit(df_train['name'], df_train['randomInt'], df_train['gender'])
错误:
Traceback (most recent call last):
File "C:\Users\KubiK\Desktop\test5.py", line 74, in <module>
clf = make_pipeline(Combined(), DictVectorizer(), LogisticRegressionCV())
TypeError: 'FeatureUnion' object is not callable
【问题讨论】:
【参考方案1】:您不能在组合对象上调用 () (您可以在类上调用它,因为它是它们的构造函数,但在组合对象中您没有 __call__
方法)所以该行必须是:
clf = make_pipeline(Combined, DictVectorizer(), LogisticRegressionCV())
【讨论】:
当我从 Combined 中删除 '()' 时,它给了我一个新错误: Traceback(最近一次调用最后一次):文件“C:\Users\KubiK\Desktop\test5.py”,第 76 行,在以上是关于使用 Python Scikit-learn 中的 Pipeline 和 featureUnion 将多个功能合二为一的主要内容,如果未能解决你的问题,请参考以下文章
python - 如何在python scikit-learn中找到逻辑回归中的正则化参数?
使用Python scikit-learn 库实现神经网络算法