sigmoid 激活和 softmax 输出的反向传播

Posted

技术标签:

【中文标题】sigmoid 激活和 softmax 输出的反向传播【英文标题】:Backpropagation for sigmoid activation and softmax output 【发布时间】:2018-10-24 12:40:01 【问题描述】:

我正在尝试构建一个 L 层神经网络,用于多类分类,在输出层使用 softmax 激活,在其他层使用 sigmoid 激活。

用于训练的函数如下所示:

def L_layer_model(X, Y, layers_dims, learning_rate=0.01, num_iterations=5000, print_cost=True): 
"""
Implements a L-layer neural network: [LINEAR->SIGMOID]*(L-1)->LINEAR->SOFTMAX.

Arguments:
X -- data, numpy array of shape (number of features, number of examples)
Y -- true "label" vector of shape (number of classes, number of examples)
layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
learning_rate -- learning rate of the gradient descent update rule
num_iterations -- number of iterations of the optimization loop
print_cost -- if True, it prints the cost every 100 steps

Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""

np.random.seed(1)
costs = []                         # keep track of cost

# Parameters initialization.

parameters = initialize_parameters_deep(layers_dims)
L = len(parameters) // 2   # number of layers in the neural network
forward_calculated = 
m = Y.shape[1]


# Loop (gradient descent)
for i in range(0, num_iterations):

    # Forward propagation: [LINEAR -> SIGMOID]*(L-1) -> LINEAR -> SOFTMAX.
    A = X
    forward_calculated["A0"] = X
    for l in range(1, L+1):   
        A_prev = A
        #print(A_prev)

        W = parameters['W' + str(l)]
        b = parameters['b' + str(l)]
        #print("W.shape: "+str(W.shape))
        #print("A_prev.shape: "+str(A_prev.shape))
        #print("b.shape: "+str(b.shape))
        Z = np.dot(W, A_prev) + b
        #Z = np.matmul(W, A_prev) + b

        assert(Z.shape == (W.shape[0], A.shape[1]))
        forward_calculated["Z" + str(l)] = Z   # store for future use

        if l != L:  # except the last layer
            A = sigmoid(Z)
        else:
            A = softmax(Z)

        #print("A is a tuple: ", end='')
        #print(isinstance(A, tuple))

        forward_calculated["A" + str(l)] = A   # store for future use

    assert(forward_calculated["A" + str(L)].shape == (NUMBER_OF_CLASSES, X.shape[1]))


    # Compute cost.
    Y_hat = forward_calculated["A" + str(L)]
    cost = compute_multiclass_loss(Y, Y_hat)

    #cost = compute_cost(AL, Y)

    # now back propagation

    grads = 
    grads['dZ' + str(L)] = forward_calculated["A" + str(L)] - Y
    grads['dW' + str(L)] = (1./m) * np.dot(grads['dZ' + str(L)], forward_calculated["A" + str(L-1)].T)
    grads['db' + str(L)] = (1./m) * np.sum(grads['dZ' + str(L)], axis=1, keepdims=True)

    for l in range(L-1, 0, -1):
        grads['dA' + str(l)] = np.dot(parameters["W" + str(l+1)].T, grads['dZ' + str(l+1)])
        #dA1 = np.matmul(W2.T, dZ2)
        grads['dZ' + str(l)] = grads['dA' + str(l)] * sigmoid(forward_calculated["Z" + str(l)]) * (1 - sigmoid(forward_calculated["Z" + str(l)]))
        #dZ1 = dA1 * sigmoid(Z1) * (1 - sigmoid(Z1))
        grads['dW' + str(l)] = (1./m) * np.dot(grads['dZ' + str(l)], forward_calculated["A" + str(l-1)].T)
        #dW1 = (1./m) * np.matmul(dZ1, X.T)
        grads['db' + str(l)] = (1./m) * np.sum(grads['dZ' + str(l)], axis=1, keepdims=True)
        #db1 = (1./m) * np.sum(dZ1, axis=1, keepdims=True)


    # Update parameters.

    for l in range(1,L+1):
        #print("grads[dW]: " + str(grads["dW" + str(l)]));
        parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * grads["dW" + str(l)]
        #print("grads[db]: " + str(grads["db" + str(l)]));
        parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * grads["db" + str(l)]

    # Print the cost every 100 training example
    if print_cost and i % 100 == 0:
        print ("Cost after iteration %i: %f" % (i, cost))

        costs.append(cost)

print(costs)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()

return parameters

当我只有一个隐藏层时,代码工作正常,模型逐渐收敛。但是,当我有多个隐藏层时,模型似乎不会收敛。它预测所有示例属于同一类。我的反向传播公式有什么错误吗?我使用的成本函数是对数损失。

def compute_multiclass_loss(Y, Y_hat):   # Y -> actual, Y_hat -> predicted

L_sum = np.sum(np.multiply(Y, np.log(Y_hat)))
m = Y.shape[1]
L = -(1/m) * L_sum

L = np.squeeze(L)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert(L.shape == ())
return L

简而言之,我的问题是,这些公式是否正确(对于输出的 log loss 和 softmax 激活以及其他层的 sigmoid 激活)?

【问题讨论】:

【参考方案1】:

代码看起来不错, 然而,这是一个称为梯度消失的概念问题。

当您使用深度网络时,您可以看到当您靠近输入层时,在计算梯度时,sigmoid 的导数会增加。

sigmoid的导数最大值为0.25,但并非总是如此,sigmoid的导数值可以接近0.001左右,这样当这些小项增加时,梯度会急剧下降。

所以,ReLU 是一种可以在一定程度上解决这个问题的东西,它的导数要么是 0,要么是 1,所以如果梯度消失,那只是由于权重而不是 Activation。

所以,在隐藏层中使用 ReLU 而不是 Sigmoid

Michael Nielsen 书中的这篇文章用微积分深入解释了这一点

http://neuralnetworksanddeeplearning.com/chap5.html#the_vanishing_gradient_problem

【讨论】:

以上是关于sigmoid 激活和 softmax 输出的反向传播的主要内容,如果未能解决你的问题,请参考以下文章

激活函数:Softmax vs Sigmoid

深度学习调参体验

深度学习调参体验

.误差反向传播法—ReLU/Sigmoid/Affine/Softmax-with-Loss层的实现

输出层的 softmax 和 sigmoid 函数

彻底理解 softmax、sigmoid、交叉熵(cross-entropy)