如何在具有多个数据帧列输入的 sklearn 管道中编写转换器

Posted 2023-03-12

技术标签:

【中文标题】如何在具有多个数据帧列输入的 sklearn 管道中编写转换器【英文标题】：How to write a transformer in sklearn pipeline with multiple dataframe column inputs 【发布时间】：2020-08-24 17:47:11 【问题描述】：

我的数据框看起来像

+---------------------+-------------+---------+---------+---------+---------+
| Date                |   pre_close |    open |    high |     low |   close |  
|---------------------+-------------+---------+---------+---------+---------+
| 1992-04-27 00:00:00 |     0.93152 | 0.93152 | 1.12912 | 0.93152 | 1.08677 |   
| 1992-04-28 00:00:00 |     1.08677 | 1.07266 | 1.12912 | 1.07266 | 1.10512 | 
| 1992-04-29 00:00:00 |     1.10512 | 1.10512 | 1.12347 | 1.08677 | 1.11077 | 
| 1992-04-30 00:00:00 |     1.11077 | 1.11077 | 1.14323 | 1.10089 | 1.1277  |   
| 1992-05-04 00:00:00 |     1.1277  | 1.17146 | 1.19969 | 1.17146 | 1.19686 |  
+---------------------+-------------+---------+---------+---------+---------+

我想使用列：pre_close、close、high、low 来通过 sklearn 管道计算度量 TR，这是我编写转换器的方式

class TR(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return np.max([X['high']-X['low'],
                      np.abs(X['pre_close']-X['high']),
                      np.abs(X['pre_close']-X['low'])], axis=1)

这是我在管道中使用它的方式

pipeline = Pipeline([("tr", TR()])

full_pipeline = ColumnTransformer([("num", pipeline, ['pre_close', 'close', 'high', 'low'])], remainder="passthrough")

data = full_pipeline.fit_transform(df)

但我收到此错误：

TypeError: Last step of Pipeline should implement fit or be the string 'passthrough'. '<function TR at 0x1a181d2170>' (type <class 'function'>) doesn't

【问题讨论】：

是否存在固有 baseestimator 的原因。因为如果不是与生俱来，它就会发挥作用 【参考方案1】：

我认为你可以简单地使用 pandas apply：

df["num"] = df.apply(lambda x: np.max([x['high']-x['high'], 
                                       np.abs(x['pre_close']-x['high']), 
                                       np.abs(x['pre_close']-x['low'])]), axis=1)

【讨论】：

以上是关于如何在具有多个数据帧列输入的 sklearn 管道中编写转换器的主要内容，如果未能解决你的问题，请参考以下文章