sklearn 转换管道和功能联合
Posted
技术标签:
【中文标题】sklearn 转换管道和功能联合【英文标题】:sklearn transformation pipeline and featureunion 【发布时间】:2018-02-16 18:01:09 【问题描述】:我在尝试运行以下代码时遇到问题。这是房价的机器学习问题。
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
num_attributes=list(housing_num)
cat_attributes=['ocean_proximity']
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class DataFrameSelector(BaseEstimator,TransformerMixin):
def __init__(self,attribute_names):
self.attribute_names=attribute_names
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
return X[self.attribute_names].values
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X,y=None):
return self # nothing else to do
def transform(self, X,y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
num_pipeline=Pipeline([
('selector',DataFrameSelector(num_attributes)),
('imputer',Imputer(strategy="median")),
('attribs_adder',CombinedAttributesAdder()),
('std_scalar',StandardScaler()),
])
cat_pipeline=Pipeline([
('selector',DataFrameSelector(cat_attributes)),
('label_binarizer',LabelBinarizer()),
])
full_pipeline=FeatureUnion(transformer_list=[
("num_pipeline",num_pipeline),
("cat_pipeline",cat_pipeline),
])
当我尝试运行时出现错误:
housing_prepared = full_pipeline.fit_transform(housing)
并且错误显示为:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-141-acd0fd68117b> in <module>()
----> 1 housing_prepared = full_pipeline.fit_transform(housing)
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit_transform(self, X, y, **fit_params)
744 delayed(_fit_transform_one)(trans, weight, X, y,
745 **fit_params)
--> 746 for name, trans, weight in self._iter())
747
748 if not result:
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in __init__(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit_transform_one(transformer, weight, X, y, **fit_params)
587 **fit_params):
588 if hasattr(transformer, 'fit_transform'):
--> 589 res = transformer.fit_transform(X, y, **fit_params)
590 else:
591 res = transformer.fit(X, y, **fit_params).transform(X)
/Users/nieguangtao/ml/env_1/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit_transform(self, X, y, **fit_params)
290 Xt, fit_params = self._fit(X, y, **fit_params)
291 if hasattr(last_step, 'fit_transform'):
--> 292 return last_step.fit_transform(Xt, y, **fit_params)
293 elif last_step is None:
294 return Xt
TypeError: fit_transform() takes exactly 2 arguments (3 given)
所以我的第一个问题是什么导致了这个错误?
得到这个错误后,我试图找出原因,所以我一个一个地运行上面的转换器:
DFS=DataFrameSelector(num_attributes)
a1=DFS.fit_transform(housing)
imputer=Imputer(strategy='median')
a2=imputer.fit_transform(a1)
CAA=CombinedAttributesAdder()
a3=CAA.fit_transform(a2)
SS=StandardScaler()
a4=SS.fit_transform(a3)
DFS2=DataFrameSelector(cat_attributes)
b1=DFS2.fit_transform(housing)
LB=LabelBinarizer()
b2=LB.fit_transform(b1)
result=np.concatenate((a4,b2),axis=1)
这些可以正确执行,除了我得到的 result 是一个大小为 (16512, 16) 的 numpy.ndarray 而housing_prepared = full_pipeline.fit_transform(housing)
的预期结果应该是一个大小为 ( 16512,17)。 所以这是我的第二个问题为什么会导致差异?
Housing 是一个大小为 (16512, 9) 的 DataFrame,只有 1 个分类特征和 8 个数值特征。
提前谢谢你。
【问题讨论】:
第一个错误是由于LabelBinarizer
。它只需要一个输入 y,但由于管道,X 和 y 都将发送给它。请分享数据,我可以提供帮助。
@VivekKumar 这是链接,是房屋数据:drive.google.com/file/d/0B12I2_fMO94pVHZhQlVrSlFtZEk/…
为什么你认为结果应该有 17 列而不是 16 列?
@VivekKumar 其实我也认为应该是 16 列。但这实际上是教科书上的一个例子。代码是他们的。他们可以成功运行我无法运行的代码,并得到我无法理解的 17 列结果。
【参考方案1】:
看起来 sklearn 以不同于您预期的方式识别数据类型。确保将数字标识为 int。最简单的方法:使用“您的”发布编码的作者提供的数据。 Aurelien Geron Hands on Machine Learning
【讨论】:
【参考方案2】:我在阅读这本书时遇到了这个问题。在尝试了一堆变通方法(我觉得这是在浪费我的时间)之后,我放弃并安装了 scikit-learn v0.20 dev。下载***here 并使用 pip 安装它。这应该允许您使用为处理这些问题而设计的 CategoricalEncoder 类。
【讨论】:
【参考方案3】:我遇到了同样的问题,它是由不会总是抛出错误的缩进问题引起的(请参阅https://***.com/a/14046894/3665886)。
如果您直接从书中复制代码,请确保代码正确缩进。
【讨论】:
【参考方案4】:-
TypeError: fit_transform() 只需要 2 个参数(给定 3 个)
为什么会出现这个错误?
答案:因为您使用的是 LabelBinarizer(),即 ideally suitable 作为响应变量。
怎么办?:你有一个few options:
改用 OneHotEncoder() 为 LabelBinarizer 编写自定义转换器 使用支持您的代码的旧版本 sklean-
housing_prepared 的形状不同
如果您使用的是this data,那么您有 9 个预测变量(8 个数字变量和 1 个分类变量)。 CombinedAttributesAdder() 增加了 3 列,LabelBinarizer() 增加了 5 列,所以变成了 17 列 请记住,sklearn.pipeline.FeatureUnion 连接多个转换器对象的结果
当您手动执行此操作时,您不会添加原始的“ocean_proximity”变量。
让我们看看它的实际效果:
print("housing_shape: ", housing.shape)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
DFS=DataFrameSelector(num_attribs)
a1=DFS.fit_transform(housing)
print('Numerical variables_shape: ', a1.shape)
imputer=SimpleImputer(strategy='median')
a2=imputer.fit_transform(a1)
a2.shape
与 a1.shape 相同
CAA=CombinedAttributesAdder()
a3=CAA.fit_transform(a2)
SS=StandardScaler()
a4=SS.fit_transform(a3) # added 3 variables
print('Numerical variable shape after CAA: ', a4.shape, '\n')
DFS2=DataFrameSelector(cat_attribs)
b1=DFS2.fit_transform(housing)
print("Categorical variables_shape: ", b1.shape)
LB=LabelBinarizer()
b2=LB.fit_transform(b1) # instead of one column now we have 5 columns
print('categorical variable shape after LabelBinarization: ', b2.shape)
增加了 4 列
print(b2)
result=np.concatenate((a4,b2),axis=1)
print('final shape: ', result.shape, '\n') # Final shape
注意:转换列(a4 的结果)和二值化列(b2 的结果)尚未添加到原始数据帧。 为此,您需要将 numpy 数组 b2 转换为数据框
new_features = pd.DataFrame(a4)
new_features.shape
ocean_cat = ['<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND']
ocean_LabelBinarize = pd.DataFrame(b2, columns=[ocean_cat[i] for i in
range(len(ocean_cat))])
ocean_LabelBinarize
housing_prepared_new = pd.concat([new_features, ocean_LabelBinarize],
axis=1)
print('Shape of new data prepared by above steps',
housing_prepared_new.shape)
当我们使用管道时,它也会保留原始 (ocean_proximity) 变量和新创建的二值化列
【讨论】:
以上是关于sklearn 转换管道和功能联合的主要内容,如果未能解决你的问题,请参考以下文章
如何修复特征联合和管道中的元组对象错误(使用 sklearn 时)?
mlflow 如何使用自定义转换器保存 sklearn 管道?