如何将 sklearn 预处理交互变量的输出连接回原始数据帧?

Posted

技术标签:

【中文标题】如何将 sklearn 预处理交互变量的输出连接回原始数据帧?【英文标题】:How to join output from sklearn preprocessing interaction variables back to original dataframe? 【发布时间】:2020-05-01 17:08:12 【问题描述】:

我正在尝试为逻辑回归模型创建交互变量。我有 70 多个功能,我只想对其中的 6 个功能进行预处理。有谁知道如何从 fit_transform 中获取 numpy 数组并将这些交互加入到可能的原始数据帧中?此外,是否有一种优雅的方式来标记交互,以便我知道我在看什么?我想我会获取 numpy 数组并通过 pd.DateFrame 转换为数据帧,但在那之后我有点迷茫。先感谢您。我在下面找到了问题,但我仍然对我的特定用例感到有些困惑。

How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?

到目前为止我的代码如下...

# Subset of dataframe to create interaction variables from 
df_interactions = df[['x1','x2','x3','x4','x5','x6']]

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(interaction_only=True)
df_interactions_T = poly.fit_transform(degrees=2, df_interactions)

【问题讨论】:

只需将其设置为列:df[['x1','x2','x3','x4','x5','x6']] = poly.fit_transform(degrees=2, df[['x1','x2','x3','x4','x5','x6']]) 【参考方案1】:

简答

您的列格式如下:

[1,
 'x1',
 'x2',
 'x3',
 'x4',
 'x5',
 'x6',
 'x1 * x2',
 'x1 * x3',
 'x1 * x4',
 'x1 * x5',
 'x1 * x6',
 'x2 * x3',
 'x2 * x4',
 'x2 * x5',
 'x2 * x6',
 'x3 * x4',
 'x3 * x5',
 'x3 * x6',
 'x4 * x5',
 'x4 * x6',
 'x5 * x6']

如果您将这些值分配给gen_col_names 变量,并转换为DataFrame,您可以看到发生了什么。

pd.DataFrame(df_interactions_T,columns=gen_col_names)

长答案

让我们访问源代码,看看发生了什么: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/preprocessing/_data.py#L1516

组合的源代码如下:

from itertools import chain, combinations
from itertools import combinations_with_replacement as combinations_w_r

def _combinations(n_features, degree, interaction_only, include_bias):
    comb = (combinations if interaction_only else combinations_w_r)
    start = int(not include_bias)
    return chain.from_iterable(comb(range(n_features), i)
                                   for i in range(start, degree + 1))

创建数据:

import numpy as np
import pandas as pd
np.random.seed(0)
cols = ['x1','x2','x3','x4','x5','x6']
df = pd.DataFrame()

for col in cols:
    df[col] = np.random.randint(1,10,100)

df_interactions = df[['x1','x2','x3','x4','x5','x6']]

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(interaction_only=True,degree=2)
df_interactions_T = poly.fit_transform(df_interactions)

你的参数如下:

n_features = 6
degree = 2
interaction_only = True
include_bias = True
combs = list(_combinations(n_features=6, degree=2, interaction_only=True, include_bias=True))
combs
[(),
 (0,),
 (1,),
 (2,),
 (3,),
 (4,),
 (5,),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (0, 5),
 (1, 2),
 (1, 3),
 (1, 4),
 (1, 5),
 (2, 3),
 (2, 4),
 (2, 5),
 (3, 4),
 (3, 5),
 (4, 5)]

您可以使用这些信息来生成列名:

gen_col_names = []
for i in combs:
    if i == ():
        gen_col_names.append(1)
    if len(i) == 1:
        gen_col_names.append(cols[i[0]])
    if len(i) == 2:
        gen_col_names.append(cols[i[0]] + ' * ' + cols[i[1]])

gen_col_names
[1,
 'x1',
 'x2',
 'x3',
 'x4',
 'x5',
 'x6',
 'x1 * x2',
 'x1 * x3',
 'x1 * x4',
 'x1 * x5',
 'x1 * x6',
 'x2 * x3',
 'x2 * x4',
 'x2 * x5',
 'x2 * x6',
 'x3 * x4',
 'x3 * x5',
 'x3 * x6',
 'x4 * x5',
 'x4 * x6',
 'x5 * x6']

【讨论】:

以上是关于如何将 sklearn 预处理交互变量的输出连接回原始数据帧?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 sklearn 中使用 OneHotEncoder 的输出?

处理多元线性回归Python中的分类和数值变量

批处理脚本将变量内容回显到文本文件[重复]

将分类变量转换为伪变量后,如何从sklearn api中找到功能的重要性?

sklearn 树在拆分期间将分类变量视为浮点数,我应该如何解决这个问题?

如何将sklearn Pipeline结构的结构和数据深度复制到新变量中?