在 python 中实现 SVM One-vs-all 时出现问题
Posted
技术标签:
【中文标题】在 python 中实现 SVM One-vs-all 时出现问题【英文标题】:Something wrong when implementing SVM One-vs-all in python 【发布时间】:2021-03-26 00:05:28 【问题描述】:我试图通过将函数 OneVsRestClassifier
与我自己的实现进行比较来验证我是否正确理解了 SVM - OVA(One-versus-All)的工作原理。
在下面的代码中,我在训练阶段实现了num_classes
分类器,然后在测试集上对它们进行了测试,并选择了返回最高概率值的那个。
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
# Read dataset
df = pd.read_csv('In/winequality-white.csv', delimiter=';')
X = df.loc[:, df.columns != 'quality']
Y = df.loc[:, df.columns == 'quality']
my_classes = np.unique(Y)
num_classes = len(my_classes)
# Train-test split
np.random.seed(42)
msk = np.random.rand(len(df)) <= 0.8
train = df[msk]
test = df[~msk]
# From dataset to features and labels
X_train = train.loc[:, train.columns != 'quality']
Y_train = train.loc[:, train.columns == 'quality']
X_test = test.loc[:, test.columns != 'quality']
Y_test = test.loc[:, test.columns == 'quality']
# Models
clf = [None] * num_classes
for k in np.arange(0,num_classes):
my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced', probability=True)
clf[k] = my_model.fit(X_train, Y_train==my_classes[k])
# Prediction
prob_table = np.zeros((len(Y_test), num_classes))
for k in np.arange(0,num_classes):
p = clf[k].predict_proba(X_test)
prob_table[:,k] = p[:,list(clf[k].classes_).index(True)]
Y_pred = prob_table.argmax(axis=1)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")
测试精度等于0.21,而使用函数OneVsRestClassifier
时,返回0.59。为了完整起见,我还报告了其他代码(预处理步骤与之前相同):
....
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")
我自己的 SVM - OVA 实现有什么问题吗?
【问题讨论】:
我猜你不应该为自己使用predict_proba
方法和在内置版本中使用 predict
方法。我还猜想accuracy_score
函数适用于预测而不是预测概率...
@AlexanderRiedel Accuracy_score
确实适用于预测。我不认为predict_proba
在使用predict
方法时可以改变SVC 的预测。我认为predict_proba
只是将概率值与决策函数相关联......
是的,但是predict_proba
不返回预测,而是返回概率矩阵...我不明白你为什么不只预测类别而是预测概率
【参考方案1】:
我自己的 SVM - OVA 实现有什么问题吗?
您有唯一的类 array([3, 4, 5, 6, 7, 8, 9])
,但是 Y_pred = prob_table.argmax(axis=1)
行假定它们是 0 索引的。
尝试重构您的代码以减少出现此类假设的错误:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
df = pd.read_csv('winequality-white.csv', delimiter=';')
y = df["quality"]
my_classes = np.unique(y)
X = df.drop("quality", axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X,y, random_state=42)
# Models
clfs = []
for k in my_classes:
my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'
, probability=True, random_state=42)
clfs.append(my_model.fit(X_train, Y_train==k))
# Prediction
prob_table = np.zeros((len(X_test),len(my_classes)))
for i,clf in enumerate(clfs):
probs = clf.predict_proba(X_test)[:,1]
prob_table[:,i] = probs
Y_pred = my_classes[prob_table.argmax(1)]
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf'
,class_weight='balanced', random_state=42))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)
Test accuracy = 61.795918367346935
Test accuracy = 58.93877551020408
请注意基于概率的 OVR 与基于标签的 OVR 的区别,后者更细粒度并产生更好的结果。
对于进一步的实验,您可能希望将分类器包装到可重用的类中:
class OVRBinomial(BaseEstimator, ClassifierMixin):
def __init__(self, cls):
self.cls = cls
def fit(self, X, y, **kwargs):
self.classes_ = np.unique(y)
self.clfs_ = []
for c in self.classes_:
clf = self.cls(**kwargs)
clf.fit(X, y == c)
self.clfs_.append(clf)
return self
def predict(self, X, **kwargs):
probs = np.zeros((len(X), len(self.classes_)))
for i, c in enumerate(self.classes_):
prob = self.clfs_[i].predict_proba(X, **kwargs)[:, 1]
probs[:, i] = prob
idx_max = np.argmax(probs, 1)
return self.classes_[idx_max]
【讨论】:
【参考方案2】:您的代码的预测部分有错误。使用命令Y_pred = prob_table.argmax(axis=1)
,您可以获得概率最大的列的索引。但是您希望拥有概率最大的类而不是列索引:
Y_pred = my_classes[prob_table.argmax(axis=1)]
【讨论】:
【参考方案3】:one-vs-rest 的基础是预测“one”类的概率(忽略“rest”类的概率),然后取概率最高的估计量。 pandas
可以通过采用.idxmax
来做到这一点,它返回概率最高的列名。
这应该可行:
import pandas
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
# Read/load dataset
dataset = load_wine()
X = dataset["data"]
y = dataset["target"]
classes =
key: value
for key, value in zip(range(len(dataset["target_names"])), dataset["target_names"])
# Create a train/test split (training set is 80% of the data, make sure the different classes are balanced across train and test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=43, shuffle=True, stratify=y
)
# Create a set of models
estimators =
for class_number, class_name in classes.items():
# Create a model
estimator = SVC(
gamma="auto", C=1000, kernel="rbf", class_weight="balanced", probability=True
)
# Fit the model, make sure y is 1 if the class is the target for this estimator, otherwise (rest) 0
estimator = estimator.fit(
X_train, [1 if element == class_number else 0 for element in y_train]
)
# Store the trained model
estimators[class_number] = estimator
# Make predictions
prediction_probabilities =
for class_number, estimator in estimators.items():
# Every estimator predicts the probability for their target class
prediction_probabilities[class_number] = estimator.predict_proba(X_test)[:, 1]
# Combine the probabilities into a dataframe
prediction_probabilities_df = pandas.DataFrame(prediction_probabilities)
# The prediction for each row is the column with the highest probability
y_pred = prediction_probabilities_df.idxmax(axis=1)
# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (custom OneVsRest): accuracy")
# Create the model
clf = OneVsRestClassifier(
SVC(gamma="auto", C=1000, kernel="rbf", class_weight="balanced")
)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (Scikit-Learn OneVsRest): accuracy")
输出:
Test accuracy (custom OneVsRest): 47.22222222222222
Test accuracy (Scikit-Learn OneVsRest): 41.66666666666667
【讨论】:
以上是关于在 python 中实现 SVM One-vs-all 时出现问题的主要内容,如果未能解决你的问题,请参考以下文章