SKlearn 随机森林输入错误

Posted

技术标签:

【中文标题】SKlearn 随机森林输入错误【英文标题】:SKlearn Random Forest error on input 【发布时间】:2016-04-18 16:31:50 【问题描述】:

我正在尝试为我的随机森林运行拟合,但出现以下错误:

forest.fit(train[features], y)

返回

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)

/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
    210         """
    211         # Validate or convert input data
--> 212         X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    213         if issparse(X):
    214             # Pre-sort indices to avoid that each individual tree of the

/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    396                              % (array.ndim, estimator_name))
    397         if force_all_finite:
--> 398             _assert_all_finite(array)
    399 
    400     shape_repr = _shape_repr(array.shape)

/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
     52             and not np.isfinite(X).all()):
     53         raise ValueError("Input contains NaN, infinity"
---> 54                          " or a value too large for %r." % X.dtype)
     55 
     56 

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

我已将我的数据帧从 float64 强制转换为 float32 以获取我的功能,并确保没有空值,因此不确定是什么引发了此错误。让我知道添加更多我的代码是否会有所帮助。

更新

它最初是一个 pandas 数据框,我删除了所有的 NaN。原始数据框是带有受访者信息的调查结果,除了我的 dv 之外,我放弃了所有问题。我通过运行返回 0 的rforest_df.isnull().sum() 再次检查了这一点。这是我用于建模的完整代码。

rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)

更新

这就是你的数据的样子

array([ 0,  1,  2,  3,  4,  3,  3,  5,  6,  7,  8,  7,  9,  6, 10,  6, 11,
        7, 11,  3,  7,  9,  6,  5,  9, 11, 12, 13,  6, 11,  3,  3,  6, 14,
       15,  0,  9,  9,  2,  0, 11,  3,  9,  4,  9,  7,  3,  4,  9, 12,  9,
        7,  6, 13,  6,  0,  0, 16,  6, 11,  4, 10, 11, 11, 17,  3,  6, 16,
        3,  4, 18, 19,  7, 11,  5, 11,  5,  4,  0,  6, 17,  7,  2,  3,  5,
       11,  8,  9, 18,  6,  9,  8,  5, 16, 20,  0,  4,  8, 13, 16,  3, 20,
        0,  5,  4,  2, 11,  0,  3,  0,  6,  6,  6,  9,  4,  6,  5, 11,  0,
       13,  6,  2, 11,  7,  5,  6, 18, 12, 21, 17,  3,  6,  0, 13, 21,  7,
        3,  2, 18, 22,  7,  3,  2,  6,  7,  8,  4,  0,  7, 12,  3,  7,  3,
        2, 11, 19, 11,  6,  2,  9,  3,  7,  9,  9,  5,  6,  8,  0, 18, 11,
        3, 12,  2,  6,  4,  7,  7, 11,  3,  6,  6,  0,  6, 12, 15,  3,  9,
        3,  3,  0,  5,  9,  7,  9, 11,  7,  3, 20,  0,  7,  6,  6, 23, 15,
       19,  0,  3,  6, 16, 13,  5,  6,  6,  3,  6, 11,  9,  0,  6, 23, 16,
        4,  0,  6, 17, 11, 17, 11,  4,  3, 13,  3, 17, 16, 11,  7,  4, 24,
        5,  2,  7,  7,  8,  3,  3, 11,  8,  7, 23,  7,  7, 11,  7, 11,  6,
       15,  3, 25,  7,  4,  5,  3, 17, 20,  3, 26,  7,  9,  6,  6, 17, 20,
        1,  0, 11,  9, 16, 20,  7,  7, 26,  3,  6, 20,  7,  2, 11,  7, 27,
        9,  4, 26, 28,  8,  6,  9, 19,  7, 29,  3,  2, 26, 30,  6, 31,  6,
       18,  3,  0, 18,  4,  7, 32,  0,  2,  8,  0,  5,  9,  4, 16,  6, 23,
        0,  7,  0,  7,  9,  6,  8,  3,  7,  9,  3,  3, 12, 11,  8, 19, 20,
        7,  3,  5, 11,  3, 11,  8,  4,  4,  6,  9,  4,  1,  3,  0,  9,  9,
        6,  7,  8, 33,  8,  7,  9, 34, 11, 11,  6,  9,  9, 17,  8, 19,  0,
        7,  4, 17,  6,  7,  0,  4, 12,  7,  6,  4, 16, 12,  9,  6,  6,  6,
        6, 26, 13,  9,  7,  2,  7,  3, 11,  3,  6,  7, 19,  4,  8,  9, 13,
       11, 15, 11,  4, 18,  7,  7,  7,  0,  5,  4,  6,  0,  3,  7,  4, 25,
       18,  6, 19,  7,  9,  4, 20,  6,  3,  7,  4, 35, 15, 11,  2, 12,  0,
        7, 32,  6, 18,  9,  9,  6,  2,  3, 19, 36, 32,  0,  7,  0,  9, 37,
        3,  5,  6,  5, 34,  2,  6,  0,  7,  0,  7,  3,  7,  4, 18, 18,  7,
        3,  7, 16,  9, 19, 13,  4, 16, 19,  3, 19, 38,  9,  4,  9,  8,  0,
       17,  0,  2,  3,  5,  6,  5, 11, 11,  2,  9,  5, 33,  9,  5,  6, 20,
       13,  3, 39, 13,  7,  0,  9,  0,  4,  6,  7, 16,  7,  0, 21,  5,  3,
       18,  5, 20,  2,  2, 14,  6, 17, 11, 11, 16, 16,  9,  8, 11,  3, 23,
        0, 11,  0,  6,  0,  0,  3, 16,  6,  7,  5,  9,  7, 13,  0, 20,  0,
       25,  6, 16,  8,  4,  4,  2,  8,  7,  5, 40,  3,  8,  5, 12,  8,  9,
        6,  6,  6,  6,  3,  7, 26,  4,  0, 13,  4,  3, 13, 12,  7,  7,  6,
        7, 19, 15,  0, 33,  4,  5,  5, 20,  3, 11,  5,  4,  7,  9,  7, 11,
       36,  9,  0,  6,  6, 11,  6,  4,  2,  5, 18,  8,  5,  5,  2, 25,  4,
       41,  7,  7,  5,  7,  3, 36, 11,  6,  9,  0,  9,  0, 16, 42, 11, 11,
       18,  9,  5, 36,  2,  9,  6,  3, 43,  9, 17, 13,  5,  9,  3,  4,  6,
       44, 37,  0, 45,  2, 18,  8, 46,  2, 12,  9,  9,  3, 16,  6, 12,  9,
        0, 11, 11,  0, 25,  8, 17,  4,  4,  3, 11,  3, 11,  6,  6,  9,  7,
       23,  0,  2,  0,  3,  3,  4,  4,  9,  5, 11, 16,  7,  3, 18, 11,  7,
        6,  6,  6,  5,  9,  6,  3,  9,  7, 17, 11,  4,  9,  2,  3,  0, 26,
        9,  0, 20,  8,  9,  6, 11,  6,  6,  7, 26,  6,  6,  4, 19,  5, 41,
       19, 18, 29,  6,  5, 13,  6, 11,  7,  7,  6,  8,  5,  0,  3, 13, 17,
        6, 20, 11,  6,  9,  6,  2,  7, 11,  9, 20, 12,  7,  6,  8,  7,  4,
        6,  2,  0,  7,  9, 26,  9, 16,  7,  4, 45,  7,  0, 23,  8,  4, 19,
        4, 26, 11,  4,  4,  5,  7,  3,  0, 29, 12,  3,  4, 11,  4, 12,  8,
        7,  5,  0, 47, 12,  0, 25,  6, 16, 20,  5,  8,  4,  4, 11, 12,  0,
        6,  3, 11,  4,  3, 48,  3,  6,  7,  4,  7,  0,  3,  7,  3, 18,  6,
        2,  9,  9, 11,  3,  9,  6, 18, 16,  6, 34,  2,  7,  4,  3, 45,  5,
        0,  7,  2, 17, 17,  9, 18,  5,  6,  5, 15,  5,  7,  6,  9,  0,  7,
       12, 17])

【问题讨论】:

请提供minimal reproducible example。谢谢! 您能否添加有关 train[features] 结构的信息。我假设它是 RF 所需的 n_samples by n_features 2D numpy 数组。 错误的最后一行也为您提供了问题的线索。您的输入特征向量是否包含 NaN? 它没有,我也使用isnull().sum() 进行了双重检查,由于引发的错误,我将数据帧从 float64 强制转换为 float32。我知道它也不包含无穷大。不确定为什么这是抛出的错误。 @benj 是的,它是数据帧的子集。 【参考方案1】:

当数据集中有空字符串(如 '')时会发生这种情况。也尝试打印类似的东西

pd.value_counts()

甚至

sorted(list(set(...)))

或在循环中获取数据集中每一列的最小值或最大值。

上面使用 MinMaxScaler 的示例可能有效,但缩放后的特征在 RF 中效果不佳。

【讨论】:

【参考方案2】:

我首先建议您通过以下方式检查 train[features] df 中每一列的数据类型:

print train[features].dtypes 

如果您看到有非数字列,您可以检查这些列以确保没有任何会导致问题的意外值(例如字符串、NaN 等)。如果您不介意删除非数字列,您可以简单地使用以下命令选择所有数字列:

numeric_cols = X.select_dtypes(include=['float64','float32']).columns

如果您愿意,您还可以添加具有 int dtypes 的列。

如果您遇到的值太大或太小而模型无法处理,则表明缩放数据是个好主意。在sklearn中,可以这样完成:

scaler = MinMaxScaler(feature_range=(0,1),copy=True).fit(train[features])
train[features] = scaler.transform(train[features])

最后,您应该考虑使用 sklearn 的 Imputer 填补缺失值,并使用以下内容填充 NaN:

train[features].fillna(0, inplace=True)

【讨论】:

以上是关于SKlearn 随机森林输入错误的主要内容,如果未能解决你的问题,请参考以下文章

R和sklearn中的随机森林

具有分类输入的回归树或随机森林回归器

Sklearn 随机森林模型不从数据框中删除标题

随机森林原理与Sklearn参数详解

sklearn RandomForest(随机森林)模型使用RandomSearchCV获取最优参数及模型效能可视化

sklearn库学习----随机森林(RandomForestClassifier,RandomForestRegressor)