为啥 CPU（使用 SKLearn）和 GPU（使用 RAPID）上的 RandomForestClassifier 获得不同的分数，非常不同？

Posted 2023-03-12

技术标签:

【中文标题】为啥 CPU（使用 SKLearn）和 GPU（使用 RAPID）上的 RandomForestClassifier 获得不同的分数，非常不同？【英文标题】：Why RandomForestClassifier on CPU (using SKLearn) and on GPU (using RAPIDs) get differents scores, very different?为什么 CPU（使用 SKLearn）和 GPU（使用 RAPID）上的 RandomForestClassifier 获得不同的分数，非常不同？ 【发布时间】：2020-06-24 07:56:10 【问题描述】：

我在带有 SKLearn 的 CPU 和使用 RAPID 的 GPU 上使用 RandomForestClassifier。我正在这两个库之间做一个关于使用 Iris 数据集的加速和评分的基准测试（这是一个尝试，在未来，我将更改数据集以获得更好的基准测试，我将从这两个库开始）。

问题是，当我在 CPU 上测量分数时总是得到 1.0 的值，但是当我尝试在 GPU 上测量分数时，我得到一个介于 0.2 和 1.0 之间的变量值，我不明白为什么会发生这种情况。

首先，我使用的库版本是：

NumPy Version: 1.17.5
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.1
cuPY Version: 6.7.0
cuDF Version: 0.12.0
cuML Version: 0.12.0
Dask Version: 2.10.1
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.12.0
MatPlotLib Version: 3.1.3
SeaBorn Version: 0.10.0

我用于 SKLearn RandomForestClassifier 的代码是：

# Read data in host memory
host_s_csv = pd.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
host_s_data = host_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
host_s_labels = host_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column

# Plot data
#sns.pairplot(host_s_csv, hue = 'variety');

# Split train and test data
host_s_data_train, host_s_data_test, host_s_labels_train, host_s_labels_test = sk_train_test_split(host_s_data, host_s_labels, test_size = 0.2, random_state = 0)

# Create RandomForest model
sk_s_random_forest = skRandomForestClassifier(n_estimators = 40,
                                             max_depth = 16,
                                             max_features = 1.0,
                                             random_state = 10, 
                                             n_jobs = 1)

# Fit data in RandomForest
sk_s_random_forest.fit(host_s_data_train, host_s_labels_train)

# Predict data
sk_s_random_forest_labels_predicted = sk_s_random_forest.predict(host_s_data_test)

# Check score
print('accuracy_score: ', sk_accuracy_score(host_s_labels_test, sk_s_random_forest_labels_predicted))

我用于 RAPIDs RandomForestClassifier 的代码是：

# Read data in device memory
device_s_csv = cudf.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
device_s_data = device_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
device_s_labels = device_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column

# Plot data
#sns.pairplot(device_s_csv.to_pandas(), hue = 'variety');

# Split train and test data
device_s_data_train, device_s_data_test, device_s_labels_train, device_s_labels_test = cu_train_test_split(device_s_data, device_s_labels, train_size = 0.8, shuffle = True, random_state = 0)

# Use same data as host
#device_s_data_train = cudf.DataFrame.from_pandas(host_s_data_train)
#device_s_data_test = cudf.DataFrame.from_pandas(host_s_data_test)
#device_s_labels_train = cudf.Series.from_pandas(host_s_labels_train).astype('int32')
#device_s_labels_test = cudf.Series.from_pandas(host_s_labels_test).astype('int32')

# Create RandomForest model
cu_s_random_forest = cusRandomForestClassifier(n_estimators = 40,
                                               max_depth = 16,
                                               max_features = 1.0,
                                               n_streams = 1)

# Fit data in RandomForest
cu_s_random_forest.fit(device_s_data_train, device_s_labels_train)

# Predict data
cu_s_random_forest_labels_predicted = cu_s_random_forest.predict(device_s_data_test)

# Check score
print('accuracy_score: ', cu_accuracy_score(device_s_labels_test, cu_s_random_forest_labels_predicted))

我正在使用的 iris 数据集的一个示例是：

你知道为什么会这样吗？两个模型设置相同，参数相同，......我不知道为什么分数之间存在如此大的差异。

谢谢。

【问题讨论】：

我不知道 RAPIDs 库，但如果在 GPU 上完成计算，通常需要先进行数据格式化。所以我会在格式化步骤中说一些东西，或者在执行的计算中说。您是否碰巧知道算法的哪一部分是在 GPU 上计算的？（该算法的结果差异大通常意味着计算分割规则的方式不同） 【参考方案1】：

我从你上面的例子中试过这个，把东西转换成 numpy 并且它工作了

import numpy as np
train_label_np = host_s_labels_train.as_matrix().astype(np.int32)
train_data_np = host_s_data_train.as_matrix().astype(np.float32)
test_label_np = host_s_labels_test.as_matrix().astype(np.int32)
test_data_np = host_s_data_test.as_matrix().astype(np.float32)

cu_s_random_forest = cusRandomForestClassifier(n_estimators = 40,
                                           max_depth = 16, n_bins =16,
                                           max_features = 1.0,
                                           n_streams = 1)

# Fit data in RandomForest
cu_s_random_forest.fit(train_data_np,train_label_np)

# Predict data (GPU does not predict for multi-class at the moment. Fixed in 0.13)
predict_np = cu_s_random_forest.predict(test_data_np, predict_model='CPU')

# Check score
print('accuracy_score: ', sk_accuracy_score(test_label_np, predict_np))

【讨论】：

顺便说一句，这也适用于 cudf 而不是 numpy。我用 cuml-0.13【参考方案2】：

这是由我们的预测代码中的一个已知问题引起的，该问题在 0.13 中已通过警告进行了纠正，并在多类分类时退回到 CPU。在 0.12 版本中，我们没有警告或回退，因此，如果您不知道在多类分类上使用 predict_model="CPU'，您将获得比使用适合你的模特。

在此处查看问题：https://github.com/rapidsai/cuml/issues/1623

这里有一些代码可以帮助您和其他人。它已经过修改，因此将来对其他人来说更容易一些。我在 GV100 和 RAPIDS 0.12 稳定版上得到 ~ 0.9333。

import cudf as cu
from cuml.ensemble import RandomForestClassifier as cusRandomForestClassifier
from cuml.metrics import accuracy_score as cu_accuracy_score
from cuml.preprocessing.model_selection import train_test_split as cu_train_test_split
import numpy as np

# data link: https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv

# Read data
df = cu.read_csv('./iris.csv', header = 0, delimiter = ',') # Get complete CSV

# Prep data
X = df.iloc[:, [0, 1, 2, 3]].astype(np.float32) # Get data columns.  Must be float32 for our Classifier
y = df.iloc[:, 4].astype('category').cat.codes # Get labels column.  Will convert to int32

cu_s_random_forest = cusRandomForestClassifier(
                                           n_bins = 16, 
                                           n_estimators = 40,
                                           max_depth = 16,
                                           max_features = 1.0,
                                           n_streams = 1)

train_data, test_data, train_label, test_label = cu_train_test_split(X, y, train_size=0.8)

# Fit data in RandomForest
cu_s_random_forest.fit(train_data,train_label)

# Predict data
predict = cu_s_random_forest.predict(test_data, predict_model="CPU") # use CPU to do multi-class classifications
print(predict)

# Check score
print('accuracy_score: ', cu_accuracy_score(test_label, predict))

【讨论】：

cuml、xgboost、lightgbm 的基于 GPU 的模型的得分始终比其 CPU 兄弟姐妹低几个百分点是否正常？感谢您的解释，@TaureanDyenNV。现在我明白问题出在哪里了。您是否计划在不久的将来包含 GPU 多类预测模型？ @SergeyBushmanov 我们有点来回走动。我的准确度分数比 CPU 好得多，有些地方比 CPU 差，但差不了多少（除非有问题）。大量用于非确定性训练的变量，例如随机森林。但是，如果我们的得分明显更差，我们很想知道并看看有什么问题可以改进它。你能在我们的 slack 频道上分享一些例子吗？ @JuMoGar 我们正在不断改进我们的代码。看着github.com/rapidsai/cuml/pull/1757，似乎我们正试图让它进入0.13，但把它推到0.14。请加入我们的社区 github、slack 和 twitter，以便您与我们交流，我们可以更好地为您提供更新rapids.ai/community.html#rapids-community

以上是关于为啥 CPU（使用 SKLearn）和 GPU（使用 RAPID）上的 RandomForestClassifier 获得不同的分数，非常不同？的主要内容，如果未能解决你的问题，请参考以下文章