与用于逻辑回归的 Scikit-Learn 相比,Tensorflow 的性能要差得多

Posted

技术标签:

【中文标题】与用于逻辑回归的 Scikit-Learn 相比,Tensorflow 的性能要差得多【英文标题】:Massively worse performance in Tensorflow compared to Scikit-Learn for Logistic Regression 【发布时间】:2019-01-16 06:04:52 【问题描述】:

我正在尝试在数值数据集上实现逻辑回归分类器。我在 Tensorflow 中构建的模型无法获得良好的准确性和损失,因此为了检查它是否在数据中,我尝试使用 scikit-learn 自己的 LogisticRegression,并获得了更好的结果。差异如此之大,以至于我怀疑我在 .tf 方面做了一些非常基本的错误......

数据预处理:

dt = pd.read_csv('data.csv', header=0)
npArray = np.array(dt)
xvals = npArray[:,1:].astype(float)
yvals = npArray[:,0]
x_proc = preprocessing.scale(xvals)

XTrain, XTest, yTrain, yTest = train_test_split(x_proc, yvals, random_state=1)

如果我现在使用 sklearn 进行逻辑回归:

log_reg = LogisticRegression(class_weight='balanced')
log_reg.fit(XTrain, yTrain)
yPred = log_reg.predict(XTest)
print (metrics.classification_report(yTest, yPred))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, yPred),2))

...我得到以下混淆矩阵:

        precision    recall  f1-score   support
      1       1.00      0.98      0.99        52
      2       0.96      1.00      0.98        52
      3       0.98      0.96      0.97        51
      4       0.98      0.97      0.97        58
      5       1.00      0.95      0.97        37
      6       0.93      1.00      0.96        65
      7       1.00      0.95      0.97        41
      8       0.94      0.98      0.96        50
      9       1.00      0.98      0.99        45
     10       1.00      0.98      0.99        49
 avg/total    0.98      0.98      0.98       500

 Overall Accuracy: 0.98

好东西,对吧?这是拆分后同一点的 Tensorflow 代码:

yTrain.resize(len(yTrain),10) #the labels are scores between 1 and 10
yTest.resize(len(yTest),10)

tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, 8], name="input") 
Y = tf.placeholder(tf.float32, [None, 10])

W = tf.Variable(tf.zeros([8, 10])) 
b = tf.Variable(tf.zeros([10])) 

out = (tf.matmul(X, W) + b)
pred = tf.nn.softmax(out, name="output")

learning_rate = 0.001
training_epochs = 100
batch_size = 200
display_step = 1

L2_LOSS = 0.01

l2 = L2_LOSS * \
    sum(tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables())

# Minimize error using cross entropy
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits = out, labels = Y)) + l2
# Gradient Descent
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

train_count = len(XTrain)

#defining optimizer and accuracy
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(Y, 1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

#----Training the model------------------------------------------
saver = tf.train.Saver()

history = dict(train_loss=[], 
                 train_acc=[], 
                 test_loss=[], 
                 test_acc=[])

sess=tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

for i in range(1, training_epochs + 1):
    for start, end in zip(range(0, train_count, batch_size),
                      range(batch_size, train_count + 1,batch_size)):
        sess.run(optimizer, feed_dict=X: XTrain[start:end],
                                   Y: yTrain[start:end])

    _, acc_train, loss_train = sess.run([pred, accuracy, cost], feed_dict=
                                        X: XTrain, Y: yTrain)

    _, acc_test, loss_test = sess.run([pred, accuracy, cost], feed_dict=
                                        X: XTest, Y: yTest)

    history['train_loss'].append(loss_train)
    history['train_acc'].append(acc_train)
    history['test_loss'].append(loss_test)
    history['test_acc'].append(acc_test)

   if i != 1 and i % 10 != 0:
       continue

   print(f'epoch: i test accuracy: acc_test loss: loss_test')

predictions, acc_final, loss_final = sess.run([pred, accuracy, cost], feed_dict=X: XTest, Y: yTest)

print()
print(f'final results: accuracy: acc_final loss: loss_final')

现在我得到以下信息:

epoch: 1 test accuracy: 0.41200000047683716 loss: 0.6921926140785217
epoch: 10 test accuracy: 0.5 loss: 0.6909801363945007
epoch: 20 test accuracy: 0.5180000066757202 loss: 0.6918861269950867
epoch: 30 test accuracy: 0.515999972820282 loss: 0.6927152872085571
epoch: 40 test accuracy: 0.5099999904632568 loss: 0.6933282613754272
epoch: 50 test accuracy: 0.5040000081062317 loss: 0.6937957406044006
epoch: 60 test accuracy: 0.5019999742507935 loss: 0.6941683292388916
epoch: 70 test accuracy: 0.5019999742507935 loss: 0.6944747567176819
epoch: 80 test accuracy: 0.4959999918937683 loss: 0.6947320103645325
epoch: 90 test accuracy: 0.46799999475479126 loss: 0.6949512958526611
epoch: 100 test accuracy: 0.4560000002384186 loss: 0.6951409578323364

final results: accuracy: 0.4560000002384186 loss: 0.6951409578323364

想法?我已经尝试过初始化权重(这里有第二个答案:How to do Xavier initialization on TensorFlow),改变学习率、时期、批量大小、L2 损失等,都没有真正的效果。任何帮助将不胜感激...

【问题讨论】:

很难说没有访问数据,但是你使用 sigmoid 交叉熵作为损失似乎很可疑,但是你在 logits 上调用了一个 softmax。 @CoryNezin 欢呼——我想我是从我看到的教程中提取的(无法追踪 atm ..),并保留它只是因为当我将损失切换到 tf.nn.softmax_cross_entropy_with_logits_v2 时,结果变得更糟:准确度 0.365... 损失:12.5...(!!)。该数据集是一个假数据集,因此有一个强烈的趋势是,每列中的值都会随着“分数”(1-10 之间)的上升而上升......可能是过度拟合的问题吗? .csv 文件在这里:mediafire.com/file/6z97ofawfa1yzbs/data.csv 问题是,我在 1998 x 10 零售数据集上尝试了该模型,我知道这很好,但结果仍然很垃圾.. 【参考方案1】:

我想我找到了问题的根源 - yTrain.resize 和 yTest.resize 在逻辑和数学方面都是愚蠢的,一旦我用单热编码数组替换它们(在 Convert array of indices to 1-hot encoded numpy array 的帮助下)这一切开始工作得更好。最终得到了与 sk-learn 相同的准确度(我认为)!

【讨论】:

以上是关于与用于逻辑回归的 Scikit-Learn 相比,Tensorflow 的性能要差得多的主要内容,如果未能解决你的问题,请参考以下文章

scikit-learn 逻辑回归预测与自我实现不同

[机器学习与scikit-learn-20]:算法-逻辑回归-线性逻辑回归linear_model.LogisticRegression与代码实现

[机器学习与scikit-learn-19]:算法-逻辑回归-概述与原理

Scikit-learn 逻辑回归的性能比 Python 中自己编写的逻辑回归差

[机器学习与scikit-learn-21]:算法-逻辑回归-多项式非线性回归PolynomialFeatures与代码实现

scikit-learn 逻辑回归类库使用小结