使用正则化 (L1 / L2) Lasso 和 Ridge 的逻辑回归模型

Posted

技术标签:

【中文标题】使用正则化 (L1 / L2) Lasso 和 Ridge 的逻辑回归模型【英文标题】:Logistic Regression Model using Regularization (L1 / L2) Lasso and Ridge 【发布时间】:2021-02-07 21:46:10 【问题描述】:

我正在尝试构建模型并创建网格搜索,下面是代码。 原始数据是从本网站下载的(信用卡欺诈数据)。 https://www.kaggle.com/mlg-ulb/creditcardfraud

读取数据后从标准化开始代码。

standardization = StandardScaler()
credit_card_fraud_df[['Amount']] = standardization.fit_transform(credit_card_fraud_df[['Amount']])
# Assigning feature variable to X
X = credit_card_fraud_df.drop(['Class'], axis=1)

# Assigning response variable to y
y = credit_card_fraud_df['Class']
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)
X_train.head()
power_transformer = PowerTransformer(copy=False)
power_transformer.fit(X_train)                       ## Fit the PT on training data
X_train_pt_df = power_transformer.transform(X_train)    ## Then apply on all data
X_test_pt_df = power_transformer.transform(X_test)
y_train_pt_df = y_train
y_test_pt_df = y_test
train_pt_df = pd.DataFrame(data=X_train_pt_df, columns=X_train.columns.tolist())
# set up cross validation scheme
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 4)

# specify range of hyperparameters
params = "C":np.logspace(-3,3,5,7), "penalty":["l1","l2"]# l1 lasso l2 ridge

## using Logistic regression for class imbalance
model = LogisticRegression(class_weight='balanced')
grid_search_cv = GridSearchCV(estimator = model, param_grid = params, 
                        scoring= 'roc_auc', 
                        cv = folds, 
                        return_train_score=True, verbose = 1)            
grid_search_cv.fit(X_train_pt_df, y_train_pt_df)
## reviewing the results
cv_results = pd.DataFrame(grid_search_cv.cv_results_)
cv_results

样本结果:

  mean_fit_time std_fit_time    mean_score_time std_score_time  param_C param_penalty   params  split0_test_score   split1_test_score   split2_test_score   split3_test_score   split4_test_score   mean_test_score std_test_score  rank_test_score
    0   0.044332    0.002040    0.000000    0.000000    0.001   l1  'C': 0.001, 'penalty': 'l1'   NaN NaN NaN NaN NaN NaN NaN 6
    1   0.477965    0.046651    0.016745    0.003813    0.001   l2  'C': 0.001, 'penalty': 'l2'   0.485714    0.428571    0.542857    0.485714    0.457143    0.480000    0.037904    5

我在输入数据中没有任何空值。我不明白为什么我会得到这些列的 Nan 值。谁能帮帮我?

【问题讨论】:

你的数据标准化了吗? 我使用standardscalar()对其进行了标准化 请minimal reproducible example @SergeyBushmanov.. 我已经分享了下载数据的链接。并且还写了一段代码。 非常感谢@SergeyBushmanov .. 它对我有用。我会通过链接。 【参考方案1】:

这里定义的默认求解器有问题:

model = LogisticRegression(class_weight='balanced')

来自以下错误消息:

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

此外,在定义参数网格之前研究 docs 可能会很有用:

penalty: ‘l1’, ‘l2’, ‘elasticnet’, ‘none’, default=‘l2’ 用于指定惩罚中使用的规范。 “newton-cg”、“sag”和“lbfgs”求解器仅支持 l2 惩罚。 'elasticnet' 仅受 'saga' 求解器支持。如果为“none”(liblinear 求解器不支持),则不应用正则化。

只要您使用支持所需网格的不同求解器对其进行纠正,您就可以开始了:

## using Logistic regression for class imbalance
model = LogisticRegression(class_weight='balanced', solver='saga')
grid_search_cv = GridSearchCV(estimator = model, param_grid = params, 
                        scoring= 'roc_auc', 
                        cv = folds, 
                        return_train_score=True, verbose = 1)            
grid_search_cv.fit(X_train_pt_df, y_train_pt_df)
## reviewing the results
cv_results = pd.DataFrame(grid_search_cv.cv_results_)

还要注意ConvergenceWarning,它可能建议您增加默认max_itertol,或切换到另一个求解器并重新考虑所需的参数网格。

【讨论】:

以上是关于使用正则化 (L1 / L2) Lasso 和 Ridge 的逻辑回归模型的主要内容,如果未能解决你的问题,请参考以下文章

正则化项L1和L2的直观理解及L1不可导处理

L1和L2正则

机器学习中正则化项L1和L2的直观理解

L1正则化和L2正则化

机器学习:模型泛化(L1L2 和弹性网络)

Sklearn 逻辑分类器的 L1 和 L2 惩罚