如何在 MLLIB / ApacheSpark 中为 RandomForrest 模型上的特征分配标签

Posted 2023-03-12

技术标签:

【中文标题】如何在 MLLIB / ApacheSpark 中为 RandomForrest 模型上的特征分配标签【英文标题】：How can I assign labels to features on a RandomForrest model in MLLIB / ApacheSpark 【发布时间】：2016-02-18 02:39:23 【问题描述】：

我已经使用 org.apache.spark.mllib.tree.RandomForest 训练了一个具有 > 100 个特征的模型，因此最终的决策树看起来像这样：

> "TreeEnsembleModel classifier with 3 trees
> 
>   Tree 0:
>     If (feature 47 <= 0.0)
>      If (feature 74 <= 0.0)
>       If (feature 62 <= -94069.0)
>        Predict: 0.0
>       Else (feature 62 > -94069.0)
>        Predict: 0.0
>      Else (feature 74 > 0.0)
>       Predict: 0.0
>     Else (feature 47 > 0.0)
>      Predict: 1.0   Tree 1:
>     If (feature 83 <= 0.0)
>      Predict: 0.0
>     Else (feature 83 > 0.0)
>      Predict: 1.0   Tree 2:
>     If (feature 81 <= 0.0)
>      Predict: 0.0
>     Else (feature 81 > 0.0)
>      If (feature 74 <= 0.0)
>       If (feature 52 <= 19.0)
>        Predict: 1.0
>       Else (feature 52 > 19.0)
>        Predict: 0.0
>      Else (feature 74 > 0.0)
>       Predict: 1.0 "

此数据是从包含标题的 CSV 文件中读取的，我在处理之前已将其保存：

val headerAndRows = rdd.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first

例如我不想看到“If (feature 47

知道如何实现这一点（无需我修改 org.apache.spark.mllib.tree.RandomForest 的源代码：）

非常感谢！

【问题讨论】：

【参考方案1】：

虽然我不是 spark 专家，但我检查了与 RF 相关的所有 API，似乎我能弄清楚的唯一方法是将字符串与您的标头匹配。

例如，使用 Regex 将 subString "feature 47" 替换为 header(47)。

另一种方法是修改spark.ml.classification.RandomRorestClassifier中的源码

【讨论】：

以上是关于如何在 MLLIB / ApacheSpark 中为 RandomForrest 模型上的特征分配标签的主要内容，如果未能解决你的问题，请参考以下文章