numpy.ndarray稀疏矩阵到密集

Question

我想在一些数据上运行sklearn的RandomForestClassifier，这些数据被打包为恰好稀疏的numpy.ndarray。调用fit给ValueError: setting an array element with a sequence.。从其他帖子我了解随机森林无法处理稀疏数据。

我希望对象有一个todense方法，但事实并非如此。

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>

我尝试用SciPy csr_matrix包装它，但也会出错。

有没有办法让随机森林接受这些数据？（不确定密集会真正适合记忆，但那是另一回事......）

编辑1

生成错误的代码就是这样：

X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')

model = RandomForestClassifier()
model.fit(X_train, train_gt.target)

至于使用toarray()的建议，ndarray没有这样的方法。 AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

而且，如上所述，对于这个特定的数据，我需要太字节的内存来保存数组。有没有选择用稀疏数组运行RandomForestClassifier？

编辑2

似乎应该使用SciPy的稀疏数据保存数据，如Save / load scipy sparse csr_matrix in portable data format所述。使用NumPy的保存/加载时，应该保存更多数据。

Answer 1

另一答案

Answer 2

另一答案