Python前向逐步回归'不在索引中'

Posted

技术标签:

【中文标题】Python前向逐步回归\'不在索引中\'【英文标题】:Python forward stepwise regression 'Not in Index'Python前向逐步回归'不在索引中' 【发布时间】:2021-04-05 14:11:09 【问题描述】:

我正在运行一些关于波士顿住房数据的教程,并借助一些在线逐步前进的示例。我不断收到一个错误,即其中一个变量不在索引中。

import statsmodels.api as sm
import pandas as  pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
boston_dataset = load_boston()

#create dataframe from boston
X = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
y = boston_dataset.target


#split data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state=5)

这里是回归循环,来自this网站,还有一段几乎相同的代码here:

def forward_regression(X, y,
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    initial_list = []
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add   with p-value '.format(best_feature, best_pval))

        if not changed:
            break

    return included

一旦我跑步 forward_regression (X_train, Y_train),我收到以下错误:

任何建议表示赞赏!

【问题讨论】:

XY的形状是什么? x: (404, 13) y: (404,) 【参考方案1】:

您需要使用idxmin() 代替argmin()。后者返回整数位置,而idxmin() 将返回标签。

固定函数是

def forward_regression(X, y,
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    initial_list = []
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            # Change argmin -> idxmin
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add   with p-value '.format(best_feature, best_pval))

        if not changed:
            break

    return included

【讨论】:

以上是关于Python前向逐步回归'不在索引中'的主要内容,如果未能解决你的问题,请参考以下文章

机器学习-正则化(岭回归lasso)和前向逐步回归

Python 对线性模型进行 特征选择,不断模型线性模型的AIC

机器学习之线性回归

机器学习之线性回归岭回归Lasso回归

机器学习实战第8章预测数值型数据:回归2

python 这演示了如何逐步计算python生成器的线性回归,以避免需要加载整个结构