如何在随机森林中使用 Spark 特征重要性？

Posted 2023-03-12

技术标签:

【中文标题】如何在随机森林中使用 Spark 特征重要性？【英文标题】：How do I use Spark's Feature Importance on Random Forest? 【发布时间】：2016-04-09 21:33:17 【问题描述】：

Random Forests 的 documentation 不包括功能重要性。但是，它在Jira 上列为已解决，并且在source code 中。 HERE 还说“此 API 与原始 MLlib 集成 API 之间的主要区别是：

支持 DataFrames 和 ML Pipelines 分类与回归的分离使用 DataFrame 元数据来区分连续和分类功能更多随机森林功能：特征估计重要性，以及每个类的预测概率（又名类条件概率）用于分类。”

但是，我想不出一种可以调用此新功能的语法。

scala> model
res13: org.apache.spark.mllib.tree.model.RandomForestModel = 
TreeEnsembleModel classifier with 10 trees

scala> model.featureImportances
<console>:60: error: value featureImportances is not a member of org.apache.spark.mllib.tree.model.RandomForestModel
              model.featureImportances

【问题讨论】：

【参考方案1】：

您必须使用新的随机森林。检查你的进口。旧：

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel

新的随机森林使用：

import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier

This S.O. answer provides code for extracting the importances.

This S.O. answer explains the sparse vector that is returned.

【讨论】：

使用新的导入 (RandomForestClassificationModel) - 你如何训练模型？ @Climbs_lika_Spyder @Yaeli778，spark.apache.org/docs/1.5.2/ml-ensembles.html有一个很好的模型训练示例你能指出如何从 pyspark 中获取 featureImportance 吗？您能告诉我们如何处理特征重要性吗？它们是一个很大的 SparseVector 并且不可解释。你如何把它们变成有用的东西？

以上是关于如何在随机森林中使用 Spark 特征重要性？的主要内容，如果未能解决你的问题，请参考以下文章