如何从 CrossValidatorModel 中提取最佳参数

Posted 2023-02-16

技术标签:

【中文标题】如何从 CrossValidatorModel 中提取最佳参数【英文标题】：How to extract best parameters from a CrossValidatorModel 【发布时间】：2015-10-23 08:02:12 【问题描述】：

我想在 Spark 1.4.x 的 CrossValidator 中找到 ParamGridBuilder 的参数，以使其成为最佳模型，

在 Spark 文档的 Pipeline Example 中，他们通过在管道中使用 ParamGridBuilder 添加不同的参数（numFeatures、regParam）。然后通过以下代码行，他们做出了最好的模型：

val cvModel = crossval.fit(training.toDF)

现在，我想知道ParamGridBuilder 中产生最佳模型的参数（numFeatures、regParam）是什么。

我已经使用了以下命令但没有成功：

cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()

有什么帮助吗？

提前致谢，

【问题讨论】：

最好的参数是dumped to log，但我无法从CrossValidatorModel 实例中访问这些信息。这真是令人沮丧。他们甚至没有在 PySpark 中记录它。缺少这么一个小而重要的东西......这让我想知道是否有人真的在使用这个功能。各位，最新版本的 Spark 有解决这个问题的办法吗？你肯定可以从cvModel.bestModel得到它，请看下面我的回答 This SO thread 有点回答这个问题。 【参考方案1】：

val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages

val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)

val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)

source

【讨论】：

【参考方案2】：

获得正确的ParamMap 对象的一种方法是使用CrossValidatorModel.avgMetrics: Array[Double] 查找argmax ParamMap：

implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) 
  def bestEstimatorParamMap: ParamMap = 
    cvModel.getEstimatorParamMaps
           .zip(cvModel.avgMetrics)
           .maxBy(_._2)
           ._1

当在您引用的管道示例中训练的CrossValidatorModel 上运行时：

scala> println(cvModel.bestEstimatorParamMap)

   hashingTF_2b0b8ccaeeec-numFeatures: 100,
   logreg_950a13184247-regParam: 0.1

【讨论】：

注意：maxBy 可能需要为minBy，具体取决于Evaluator.isLargerBetter 的值。【参考方案3】：

这是 ParamGridBuilder()

paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
    lr.regParam, [0.1, 0.01, 0.001]
).build()

管道中有 3 个阶段。看来我们可以评估如下参数：

for stage in cv_model.bestModel.stages:
    print 'stages: '.format(stage)
    print stage.params
    print '\n'

stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]

stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]

stage: LogisticRegression_451b8c8dbef84ecab7a9
[]

但是，最后阶段没有参数，logiscRegression。

我们还可以从logistregression中得到weight和intercept参数，如下所示：

cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])

全面探索： http://kuanliang.github.io/2016-06-07-SparkML-pipeline/

【讨论】：

【参考方案4】：

这是您获取所选参数的方式

println(cvModel.bestModel.getMaxIter)   
println(cvModel.bestModel.getRegParam)

【讨论】：

请不要对多个问题添加相同的答案。回答最好的一个并将其余的标记为重复。见meta.stackexchange.com/questions/104227/…【参考方案5】：

这个 java 代码应该可以工作： cvModel.bestModel().parent().extractParamMap().you 可以把它翻译成 scala 代码 parent()method 将返回一个估算器，然后您可以获得最佳参数。

【讨论】：

这也是 pySpark 的正确答案！关键是“父母”！在 pySpark 中，我使用 modelOnly.bestModel.stages[-1]._java_obj.parent().getRegParam()。【参考方案6】：

要打印paramMap 中的所有内容，您实际上不必调用 parent：

cvModel.bestModel.extractParamMap()

回答 OP 的问题，获取单个最佳参数，例如regParam：

cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))

【讨论】：

请注意，此解决方案适用于单个对象。在 Pipeline 的情况下，它返回一个空映射。【参考方案7】：

我正在使用 Spark Scala 1.6.x，这是一个完整示例，说明我如何设置和拟合 CrossValidator，然后返回用于获得最佳模型的参数值（假设 training.toDF 给出一个可以使用的数据框）：

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.CrossValidator, ParamGridBuilder
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Instantiate a LogisticRegression object
val lr = new LogisticRegression()

// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()

// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)

// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel                    // Getting the best model
val paramReference = bestModel.getParam("regParam")  // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference)       // Getting the value of this parameter
print(paramValue)                                    // In my case : 0.001

您可以对任何参数或任何其他类型的模型执行相同操作。

【讨论】：

【参考方案8】：

如果是java，看这个debug show；

bestModel.parent().extractParamMap()

【讨论】：

【参考方案9】：

在@macfeliga 的解决方案中构建，这是一个适用于管道的单一衬垫：

cvModel.bestModel.asInstanceOf[PipelineModel]
    .stages.foreach(stage => println(stage.extractParamMap))

【讨论】：

【参考方案10】：

This SO thread 有点回答这个问题。

简而言之，您需要将每个对象强制转换为它应该是的类。

对于CrossValidatorModel的情况，以下是我做的：

import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel

// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)

// To get the parameters of the best model
(
    reloadedCvModel.bestModel
        .asInstanceOf[PipelineModel]
        .stages(1)
        .asInstanceOf[RandomForestRegressionModel]
        .extractParamMap()
)

在示例中，我的管道有两个阶段（一个 VectorIndexer 和一个 RandomForestRegressor），因此我的模型的阶段索引为 1。

【讨论】：

【参考方案11】：

对我来说，@orangeHIX 解决方案是完美的：

val cvModel = cv.fit(training)

val cvMejorModelo = cvModel.bestModel.asInstanceOf[ALSModel]

cvMejorModelo.parent.extractParamMap()

res86: org.apache.spark.ml.param.ParamMap =

    als_08eb64db650d-alpha: 0.05,
    als_08eb64db650d-checkpointInterval: 10,
    als_08eb64db650d-coldStartStrategy: drop,
    als_08eb64db650d-finalStorageLevel: MEMORY_AND_DISK,
    als_08eb64db650d-implicitPrefs: false,
    als_08eb64db650d-intermediateStorageLevel: MEMORY_AND_DISK,
    als_08eb64db650d-itemCol: product,
    als_08eb64db650d-maxIter: 10,
    als_08eb64db650d-nonnegative: false,
    als_08eb64db650d-numItemBlocks: 10,
    als_08eb64db650d-numUserBlocks: 10,
    als_08eb64db650d-predictionCol: prediction,
    als_08eb64db650d-rank: 1,
    als_08eb64db650d-ratingCol: rating,
    als_08eb64db650d-regParam: 0.1,
    als_08eb64db650d-seed: 1994790107,
    als_08eb64db650d-userCol: user

【讨论】：

以上是关于如何从 CrossValidatorModel 中提取最佳参数的主要内容，如果未能解决你的问题，请参考以下文章