如何将模型 (GLM) 从 h2o 移植到 scikit-learn?
Posted
技术标签:
【中文标题】如何将模型 (GLM) 从 h2o 移植到 scikit-learn?【英文标题】:How can I port a model (GLM) from h2o to scikit-learn? 【发布时间】:2020-10-02 08:01:43 【问题描述】:我正在尝试训练一种机器学习算法来预测一些数据(实数)。
我成功地使用 h2o automl 找到了一个几乎完美地预测我的变量的模型(16k+ 测试观察中的最大误差为
现在我想使用 scikit-learn 和 pandas 重现该模型,因为这些是我在项目的其他部分大量使用的库。
这里有人可以帮助我如何做这个端口,所以我以后不需要使用 h2o 了吗?
这就是 h2o 模型的样子:
Model Details
=============
H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_1_AutoML_20200611_172640
GLM Model: summary
family link regularization lambda_search number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
0 gaussian identity Ridge ( lambda = 2.52E-5 ) nlambda = 30, lambda.max = 15.647, lambda.min = 7.073E-4, lambda.1... 32 32 30 automl_training_py_2_sid_b7bf
ModelMetricsRegressionGLM: glm
** Reported on train data. **
MSE: 3.339037335446298e-07
RMSE: 0.0005778440391183678
MAE: 0.00034964760501650187
RMSLE: 0.000428181614223859
R^2: 0.9999869058722919
Mean Residual Deviance: 3.339037335446298e-07
Null degrees of freedom: 5470919
Residual degrees of freedom: 5470887
Null deviance: 139509.91273673013
Residual deviance: 1.826760613923986
AIC: -66058752.72303456
ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **
MSE: 3.8803522694657527e-07
RMSE: 0.0006229247361813266
MAE: 0.0003746132971400872
RMSLE: 0.00046256895714739585
R^2: 0.9999847830907341
Mean Residual Deviance: 3.8803522694657527e-07
Null degrees of freedom: 5470919
Residual degrees of freedom: 5470887
Null deviance: 139510.0902560055
Residual deviance: 2.1229096838065575
AIC: -65236783.11156034
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 3.6770012E-4 4.1263074E-6 3.6505258E-4 3.706906E-4 3.6763187E-4 3.7266721E-4 3.6245832E-4
1 mean_residual_deviance 3.7148237E-7 6.667412E-9 3.680999E-7 3.7594532E-7 3.6979083E-7 3.802558E-7 3.6332003E-7
2 mse 3.7148237E-7 6.667412E-9 3.680999E-7 3.7594532E-7 3.6979083E-7 3.802558E-7 3.6332003E-7
3 null_deviance 27902.018 24.152164 27880.328 27876.885 27900.943 27932.406 27919.527
4 r2 0.99998546 2.5938294E-7 0.9999856 0.9999852 0.9999855 0.9999851 0.99998575
5 residual_deviance 0.4064701 0.007295375 0.40276903 0.41135335 0.40461922 0.41606984 0.397539
6 rmse 6.094739E-4 5.466305E-6 6.0671236E-4 6.131438E-4 6.081043E-4 6.166489E-4 6.0276035E-4
7 rmsle 4.5219096E-4 3.5735677E-6 4.508027E-4 4.5445774E-4 4.5059595E-4 4.5708878E-4 4.4800964E-4
Scoring History:
timestamp duration iteration lambda predictors deviance_train deviance_test deviance_xval deviance_se
0 2020-06-11 17:29:06 0.000 sec 1 .16E2 33 0.014121 NaN 0.015569 7.637899e-06
1 2020-06-11 17:29:07 0.596 sec 2 .97E1 33 0.010911 NaN 0.012416 7.202663e-06
2 2020-06-11 17:29:08 1.141 sec 3 .6E1 33 0.007864 NaN 0.009245 6.969671e-06
3 2020-06-11 17:29:08 1.748 sec 4 .37E1 33 0.005319 NaN 0.006438 6.863783e-06
4 2020-06-11 17:29:09 2.294 sec 5 .23E1 33 0.003427 NaN 0.004231 6.587463e-06
5 2020-06-11 17:29:09 2.851 sec 6 .14E1 33 0.002145 NaN 0.002679 5.981752e-06
6 2020-06-11 17:29:10 3.437 sec 7 .9E0 33 0.001332 NaN 0.001665 5.137335e-06
7 2020-06-11 17:29:10 3.983 sec 8 .56E0 33 0.000835 NaN 0.001036 4.157958e-06
8 2020-06-11 17:29:11 4.563 sec 9 .35E0 33 0.000531 NaN 0.000654 3.186066e-06
9 2020-06-11 17:29:12 5.165 sec 10 .21E0 33 0.000343 NaN 0.000417 2.279007e-06
10 2020-06-11 17:29:12 5.690 sec 11 .13E0 33 0.000220 NaN 0.000269 1.508395e-06
11 2020-06-11 17:29:13 6.277 sec 12 .83E-1 33 0.000141 NaN 0.000174 9.034596e-07
12 2020-06-11 17:29:13 6.853 sec 13 .51E-1 33 0.000090 NaN 0.000111 4.893595e-07
13 2020-06-11 17:29:14 7.396 sec 14 .32E-1 33 0.000057 NaN 0.000071 2.314916e-07
14 2020-06-11 17:29:14 7.990 sec 15 .2E-1 33 0.000035 NaN 0.000044 1.386863e-07
15 2020-06-11 17:29:15 8.560 sec 16 .12E-1 33 0.000021 NaN 0.000027 3.953892e-08
16 2020-06-11 17:29:16 9.146 sec 17 .77E-2 33 0.000012 NaN 0.000016 3.271502e-08
17 2020-06-11 17:29:16 9.753 sec 18 .48E-2 33 0.000007 NaN 0.000009 3.171957e-08
18 2020-06-11 17:29:17 10.309 sec 19 .3E-2 33 0.000004 NaN 0.000005 1.575243e-08
19 2020-06-11 17:29:17 10.875 sec 20 .18E-2 33 0.000002 NaN 0.000003 1.565016e-08
See the whole table with table.as_data_frame()
似乎我应该能够使用 Ridge 或 Lasso(或者可能是 Tweedy?)获得相同的结果,但我尝试了几个参数并得到了糟糕的结果。
谁能帮帮我?我已经阅读了h2o docs 和scikit docs,但不知道如何继续。
【问题讨论】:
【参考方案1】:这是相关的,已在此处回答:https://***.com/a/68370871/17441922
Sklearn 基于 Python/Cython/C,而 H2O 使用 Java。底层算法也可能不同。但是,您可以尝试在两者之间匹配/转换您的超参数,因为它们会相似
【讨论】:
正如目前所写,您的答案尚不清楚。请edit 添加其他详细信息,以帮助其他人了解这如何解决所提出的问题。你可以找到更多关于如何写好答案的信息in the help center。以上是关于如何将模型 (GLM) 从 h2o 移植到 scikit-learn?的主要内容,如果未能解决你的问题,请参考以下文章
R语言基于h2o包构建二分类模型:使用h2o.glm构建正则化的逻辑回归模型使用h2o.auc计算模型的AUC值