如何将randomforest分类器应用于所有数据集，一次在python中使用一小部分

Question

所以我正在进行一场Kaggle比赛，测试数据集的大小为880,000行。我想在其10,000行部分应用随机森林分类器。但仍然适用于所有这些。这是我的分类器的设置方式

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)
# Training data features, skip the first column 'Crime Category'
train_features = train[:, 1:]

# 'Crime Category' column values
train_target = train[:, 0]

clf = clf.fit(train_features, train_target)
score = clf.score(train_features, train_target)
"Mean accuracy of Random Forest: {0}".format(score)

我用它来训练我的模型并获得准确性。我使训练数据变小了，这样我获得结果会更快。但为了让我提交给Kaggle，我需要预测测试数据。基本上我想这样做：

test_x = testing_data[:, 1:]
print('-',*38)
for every 10,000 rows in test_x
   test_ y = clf.predict(value)
   print(".")
   add the values to an array then do the next 10,000 rows

对于我想要预测值的每10,000行，在某处添加预测值然后执行接下来的10,000行。每当我一次全部880,000行时，我的计算机就会冻结。我希望通过一次做10,000行并使用print（“。”），我会得到一个进度条。我使用pandas将test.csv从dataframe values更改为test= test.values。

我尽可能多地提供信息，如果您需要更多信息，请告诉我。

Answer 1

另一答案

Answer 2

另一答案