SKlearn 随机森林输入错误
Posted
技术标签:
【中文标题】SKlearn 随机森林输入错误【英文标题】:SKlearn Random Forest error on input 【发布时间】:2016-04-18 16:31:50 【问题描述】:我正在尝试为我的随机森林运行拟合,但出现以下错误:
forest.fit(train[features], y)
返回
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
210 """
211 # Validate or convert input data
--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
213 if issparse(X):
214 # Pre-sort indices to avoid that each individual tree of the
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
396 % (array.ndim, estimator_name))
397 if force_all_finite:
--> 398 _assert_all_finite(array)
399
400 shape_repr = _shape_repr(array.shape)
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
52 and not np.isfinite(X).all()):
53 raise ValueError("Input contains NaN, infinity"
---> 54 " or a value too large for %r." % X.dtype)
55
56
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
我已将我的数据帧从 float64 强制转换为 float32 以获取我的功能,并确保没有空值,因此不确定是什么引发了此错误。让我知道添加更多我的代码是否会有所帮助。
更新
它最初是一个 pandas 数据框,我删除了所有的 NaN。原始数据框是带有受访者信息的调查结果,除了我的 dv 之外,我放弃了所有问题。我通过运行返回 0 的rforest_df.isnull().sum()
再次检查了这一点。这是我用于建模的完整代码。
rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)
更新
这就是你的数据的样子
array([ 0, 1, 2, 3, 4, 3, 3, 5, 6, 7, 8, 7, 9, 6, 10, 6, 11,
7, 11, 3, 7, 9, 6, 5, 9, 11, 12, 13, 6, 11, 3, 3, 6, 14,
15, 0, 9, 9, 2, 0, 11, 3, 9, 4, 9, 7, 3, 4, 9, 12, 9,
7, 6, 13, 6, 0, 0, 16, 6, 11, 4, 10, 11, 11, 17, 3, 6, 16,
3, 4, 18, 19, 7, 11, 5, 11, 5, 4, 0, 6, 17, 7, 2, 3, 5,
11, 8, 9, 18, 6, 9, 8, 5, 16, 20, 0, 4, 8, 13, 16, 3, 20,
0, 5, 4, 2, 11, 0, 3, 0, 6, 6, 6, 9, 4, 6, 5, 11, 0,
13, 6, 2, 11, 7, 5, 6, 18, 12, 21, 17, 3, 6, 0, 13, 21, 7,
3, 2, 18, 22, 7, 3, 2, 6, 7, 8, 4, 0, 7, 12, 3, 7, 3,
2, 11, 19, 11, 6, 2, 9, 3, 7, 9, 9, 5, 6, 8, 0, 18, 11,
3, 12, 2, 6, 4, 7, 7, 11, 3, 6, 6, 0, 6, 12, 15, 3, 9,
3, 3, 0, 5, 9, 7, 9, 11, 7, 3, 20, 0, 7, 6, 6, 23, 15,
19, 0, 3, 6, 16, 13, 5, 6, 6, 3, 6, 11, 9, 0, 6, 23, 16,
4, 0, 6, 17, 11, 17, 11, 4, 3, 13, 3, 17, 16, 11, 7, 4, 24,
5, 2, 7, 7, 8, 3, 3, 11, 8, 7, 23, 7, 7, 11, 7, 11, 6,
15, 3, 25, 7, 4, 5, 3, 17, 20, 3, 26, 7, 9, 6, 6, 17, 20,
1, 0, 11, 9, 16, 20, 7, 7, 26, 3, 6, 20, 7, 2, 11, 7, 27,
9, 4, 26, 28, 8, 6, 9, 19, 7, 29, 3, 2, 26, 30, 6, 31, 6,
18, 3, 0, 18, 4, 7, 32, 0, 2, 8, 0, 5, 9, 4, 16, 6, 23,
0, 7, 0, 7, 9, 6, 8, 3, 7, 9, 3, 3, 12, 11, 8, 19, 20,
7, 3, 5, 11, 3, 11, 8, 4, 4, 6, 9, 4, 1, 3, 0, 9, 9,
6, 7, 8, 33, 8, 7, 9, 34, 11, 11, 6, 9, 9, 17, 8, 19, 0,
7, 4, 17, 6, 7, 0, 4, 12, 7, 6, 4, 16, 12, 9, 6, 6, 6,
6, 26, 13, 9, 7, 2, 7, 3, 11, 3, 6, 7, 19, 4, 8, 9, 13,
11, 15, 11, 4, 18, 7, 7, 7, 0, 5, 4, 6, 0, 3, 7, 4, 25,
18, 6, 19, 7, 9, 4, 20, 6, 3, 7, 4, 35, 15, 11, 2, 12, 0,
7, 32, 6, 18, 9, 9, 6, 2, 3, 19, 36, 32, 0, 7, 0, 9, 37,
3, 5, 6, 5, 34, 2, 6, 0, 7, 0, 7, 3, 7, 4, 18, 18, 7,
3, 7, 16, 9, 19, 13, 4, 16, 19, 3, 19, 38, 9, 4, 9, 8, 0,
17, 0, 2, 3, 5, 6, 5, 11, 11, 2, 9, 5, 33, 9, 5, 6, 20,
13, 3, 39, 13, 7, 0, 9, 0, 4, 6, 7, 16, 7, 0, 21, 5, 3,
18, 5, 20, 2, 2, 14, 6, 17, 11, 11, 16, 16, 9, 8, 11, 3, 23,
0, 11, 0, 6, 0, 0, 3, 16, 6, 7, 5, 9, 7, 13, 0, 20, 0,
25, 6, 16, 8, 4, 4, 2, 8, 7, 5, 40, 3, 8, 5, 12, 8, 9,
6, 6, 6, 6, 3, 7, 26, 4, 0, 13, 4, 3, 13, 12, 7, 7, 6,
7, 19, 15, 0, 33, 4, 5, 5, 20, 3, 11, 5, 4, 7, 9, 7, 11,
36, 9, 0, 6, 6, 11, 6, 4, 2, 5, 18, 8, 5, 5, 2, 25, 4,
41, 7, 7, 5, 7, 3, 36, 11, 6, 9, 0, 9, 0, 16, 42, 11, 11,
18, 9, 5, 36, 2, 9, 6, 3, 43, 9, 17, 13, 5, 9, 3, 4, 6,
44, 37, 0, 45, 2, 18, 8, 46, 2, 12, 9, 9, 3, 16, 6, 12, 9,
0, 11, 11, 0, 25, 8, 17, 4, 4, 3, 11, 3, 11, 6, 6, 9, 7,
23, 0, 2, 0, 3, 3, 4, 4, 9, 5, 11, 16, 7, 3, 18, 11, 7,
6, 6, 6, 5, 9, 6, 3, 9, 7, 17, 11, 4, 9, 2, 3, 0, 26,
9, 0, 20, 8, 9, 6, 11, 6, 6, 7, 26, 6, 6, 4, 19, 5, 41,
19, 18, 29, 6, 5, 13, 6, 11, 7, 7, 6, 8, 5, 0, 3, 13, 17,
6, 20, 11, 6, 9, 6, 2, 7, 11, 9, 20, 12, 7, 6, 8, 7, 4,
6, 2, 0, 7, 9, 26, 9, 16, 7, 4, 45, 7, 0, 23, 8, 4, 19,
4, 26, 11, 4, 4, 5, 7, 3, 0, 29, 12, 3, 4, 11, 4, 12, 8,
7, 5, 0, 47, 12, 0, 25, 6, 16, 20, 5, 8, 4, 4, 11, 12, 0,
6, 3, 11, 4, 3, 48, 3, 6, 7, 4, 7, 0, 3, 7, 3, 18, 6,
2, 9, 9, 11, 3, 9, 6, 18, 16, 6, 34, 2, 7, 4, 3, 45, 5,
0, 7, 2, 17, 17, 9, 18, 5, 6, 5, 15, 5, 7, 6, 9, 0, 7,
12, 17])
【问题讨论】:
请提供minimal reproducible example。谢谢! 您能否添加有关 train[features] 结构的信息。我假设它是 RF 所需的 n_samples by n_features 2D numpy 数组。 错误的最后一行也为您提供了问题的线索。您的输入特征向量是否包含 NaN? 它没有,我也使用isnull().sum()
进行了双重检查,由于引发的错误,我将数据帧从 float64 强制转换为 float32。我知道它也不包含无穷大。不确定为什么这是抛出的错误。
@benj 是的,它是数据帧的子集。
【参考方案1】:
当数据集中有空字符串(如 '')时会发生这种情况。也尝试打印类似的东西
pd.value_counts()
甚至
sorted(list(set(...)))
或在循环中获取数据集中每一列的最小值或最大值。
上面使用 MinMaxScaler 的示例可能有效,但缩放后的特征在 RF 中效果不佳。
【讨论】:
【参考方案2】:我首先建议您通过以下方式检查 train[features] df 中每一列的数据类型:
print train[features].dtypes
如果您看到有非数字列,您可以检查这些列以确保没有任何会导致问题的意外值(例如字符串、NaN 等)。如果您不介意删除非数字列,您可以简单地使用以下命令选择所有数字列:
numeric_cols = X.select_dtypes(include=['float64','float32']).columns
如果您愿意,您还可以添加具有 int dtypes 的列。
如果您遇到的值太大或太小而模型无法处理,则表明缩放数据是个好主意。在sklearn中,可以这样完成:
scaler = MinMaxScaler(feature_range=(0,1),copy=True).fit(train[features])
train[features] = scaler.transform(train[features])
最后,您应该考虑使用 sklearn 的 Imputer 填补缺失值,并使用以下内容填充 NaN:
train[features].fillna(0, inplace=True)
【讨论】:
以上是关于SKlearn 随机森林输入错误的主要内容,如果未能解决你的问题,请参考以下文章
sklearn RandomForest(随机森林)模型使用RandomSearchCV获取最优参数及模型效能可视化
sklearn库学习----随机森林(RandomForestClassifier,RandomForestRegressor)