org.apache.spark.SparkException：带有 TrainValidationSplit 的看不见的标签

Posted 2023-04-15

技术标签:

【中文标题】org.apache.spark.SparkException：带有 TrainValidationSplit 的看不见的标签【英文标题】：org.apache.spark.SparkException: Unseen label with TrainValidationSplit 【发布时间】：2017-04-27 16:06:48 【问题描述】：

我正在搜索此错误，但没有找到与 TrainValidationSplit 相关的任何内容。所以我想进行参数调整，使用TrainValidationSplit 这样做会产生以下错误：org.apache.spark.SparkException: Unseen label。

我理解为什么会发生这种情况，增加 trainRatio 可以缓解问题，但并不能完全解决问题。就此而言，这是（部分）代码：

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = [x+"Index" for x in categoricalCols] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel')
stages += [labelIndexer]

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [dt]

evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6])
             .addGrid(dt.maxBins, [20,40])
             .build())

pipeline = Pipeline(stages=stages)

trainValidationSplit = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.95)

model = trainValidationSplit.fit(train_dataset)
train_dataset= model.transform(train_dataset)

我已经看到了这个answer，但我不确定它是否也适用于我的情况，我想知道是否有更合适的解决方案。请帮忙？

【问题讨论】：

请记住，您应该在“进行特征标准化之前”拆分数据以进行训练/测试。否则你会遇到“数据泄露”。 【参考方案1】：

Unseen label 异常通常与StringIndexer 相关联。

您将数据分为训练 (95%) 和验证 (5%) 数据集。我认为有一些类别值（在categoricalCol 列中）出现在训练数据中，但没有出现在验证集中。

因此，在验证过程的字符串索引阶段，StringIndexer 看到一个看不见的标签并抛出该异常。通过增加训练比率，您增加了训练集中的类别值是验证集中类别值的超集的机会，但这只是一种解决方法，因为不能保证。

一种可能的解决方案：fit StringIndexer 和 train_dataset 首先，然后将生成的 StringIndexerModel 添加到管道阶段。这样StringIndexer 就会看到所有可能的类别值。

for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    strIndexModel = stringIndexer.fit(train_dataset)
    stages += [strIndexModel]

【讨论】：

以上是关于org.apache.spark.SparkException：带有 TrainValidationSplit 的看不见的标签的主要内容，如果未能解决你的问题，请参考以下文章