多标签分类的 sigmoid 非线性阈值

Posted 2023-03-12

技术标签:

【中文标题】多标签分类的 sigmoid 非线性阈值【英文标题】：Threshold value for sigmoid nonlinearity for multilabel classification 【发布时间】：2018-11-13 03:31:28 【问题描述】：

我正在尝试使用 DenseNet architecture 对来自 https://www.kaggle.com/nih-chest-xrays/data 的 X 射线图像进行分类。该模型生成一个二元标签向量，其中每个标签表示存在或不存在 14 种可能的病理：肺不张、心脏扩大、巩固、水肿、积液、肺气肿、纤维化、疝气、浸润、肿块、结节、胸膜增厚、肺炎和气胸。例如，健康患者的标签为 [0,0,0,0,0,0,0,0,0,0,0,0,0,0]，而水肿和积液患者的标签为 [0,0,0,0,0,0,0,0,0,0,0,0,0,0] [0,0,0,1,1,0,0,0,0,0,0,0,0,0] 的标签。我用 tensorflow 构建了这个模型，因为这是一个多标签分类问题，所以我使用的成本函数是 tf.reduce_mean(tf.losses.sigmoid_cross_entropy(labels, logits))，它是用 AdamOptimizer 最小化的。但是，当我检查 sigmoid 输出时，这些值都低于 0.5，导致 tf.round(logits) 为每个预测生成零。对于不同的输入，实际的 logits 是不同的，并且在 10000 次迭代后是非零值，所以我认为梯度消失不是问题。我有两个问题：

此问题是否是由模型的错误实现引起的？如果我将 sigmoid 函数的阈值从 0.5 降低到 0.25 以提高模型准确性，我会不会“作弊”？

谢谢。

这是模型的代码：

def DenseNet(features, labels, mode, params):

depth = params["depth"]
k = params["growth"]

if depth == 121:
    N = db_121
else:
    N = db_169

bottleneck_output = 4 * k

#before entering the first dense block, a conv operation with 16 output channels
#is performed on the input images

with tf.variable_scope('input_layer'):
    #l = tf.reshape(features, [-1, 224, 224, 1])
    feature_maps = 2 * k
    l = layers.conv(features, filter_size = 7, stride = 2, out_chn = feature_maps)
    l = tf.nn.max_pool(l,
                       padding='SAME',
                       ksize=[1,3,3,1],
                       strides=[1,2,2,1],
                       name='max_pool')

# each block is defined as a dense block + transition layer
with tf.variable_scope('block1'):
    for i in range(N[0]):
        with tf.variable_scope('bottleneck_layer.'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.'.format(i+1), l, bn_l)
    l = layers.transition_layer('transition1', l)

with tf.variable_scope('block2'):
    for i in range(N[1]):
        with tf.variable_scope('bottleneck_layer.'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.'.format(i+1), l, bn_l)
    l = layers.transition_layer('transition2', l)

with tf.variable_scope('block3'):
    for i in range(N[2]):
        with tf.variable_scope('bottleneck_layer.'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.'.format(i+1), l, bn_l)
    l = layers.transition_layer('transition3', l)

# the last block does not have a transition layer
with tf.variable_scope('block4'):
    for i in range(N[3]):
        with tf.variable_scope('bottleneck_layer.'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.'.format(i+1), l, bn_l)

# classification (global max pooling and softmax)
with tf.name_scope('classification'):
    l = layers.batch_norm('BN', l)
    l = tf.nn.relu(l, name='relu')
    l = layers.pooling(l, filter_size = 7)
    l_shape = l.get_shape().as_list()
    l = tf.reshape(l, [-1, l_shape[1] * l_shape[2] * l_shape[3]])
    l = tf.layers.dense(l, units = 1000, activation = tf.nn.relu, name='fc1', kernel_initializer=tf.contrib.layers.xavier_initializer())
    output = tf.layers.dense(l, units = 14, name='fc2', kernel_initializer=tf.contrib.layers.xavier_initializer()) # [batch_size, 14]

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=output) # cost function
cost = tf.reduce_mean(cross_entropy, name='cost_fn')

【问题讨论】：

我认为您走在正确的道路上，但您可能以错误的方式思考问题。可能是正面（1s）比负面（0s）少得多的情况。根据你的损失函数，想想这可能会驱动一个 softmax 层做什么（直觉上，作为一个猜测全 1 或全 0 的模型会更好吗？）。我认为你在正确的轨道上。想想精确度、召回率和你真正想让模型做什么。如果这不能引导您朝着正确的方向前进，很高兴写一个完整的答案 @PeterBarrettBryan 感谢您的建议。你是绝对正确的：“没有发现”代表数据集的一半以上，因此模型输出 0（或接近 0 的值）以最小化成本函数是有意义的。那么优化加权交叉熵损失函数会更好吗？我没有统计背景，所以我不确定最佳做法是什么...... 我认为我提供的答案应该会指导您解决此类多标签问题的正确方向（可能由于预测标签之间的相互关系而变得复杂）。如果没有帮助，请告诉我，我很乐意编辑！ 【参考方案1】：

茶碱！首先，让我重复一下我留下的评论，以防这个答案最终对您（可能还有其他人）长期有效：

我认为您走在正确的道路上，但您可能以错误的方式思考问题。可能是正面（1s）比负面（0s）少得多的情况。根据你的损失函数，想想这可能会驱动一个 softmax 层做什么（直觉上，作为一个猜测全 1 或全 0 的模型会更好吗？）。我认为你在正确的轨道上。想想精确度、召回率和你真正想让模型做什么。如果这不能引导您走向正确的方向，很高兴写一个完整的答案

您的问题有点棘手，因为我不知道预测值之间关系的完整背景（预测的类别是独立的，严重依赖的等）此外，您必须致电准确率和召回率的价值（你认为误报更糟吗？误报？它们同样糟糕吗？）。我认为对于初次通过，可能值得尝试weighted_cross_entropy_with_logits。您可以根据指导您的精确召回决策的启发式方法使模型做出正面和负面判断（在医学数据上，我认为假阴性是非常糟糕的事情）

此答案基于对您问题的 1000 英尺视角，因此，如果它不适合您，很高兴修改我的答案！如果您正在寻找纯粹的准确性（以牺牲精度/召回平衡为代价），那么可能值得尝试证明在训练集中您可以近似测试集中类的频率（然后将各个预测加权到匹配）。只要仔细实施，您的阈值化想法就会失效（不要在训练和测试之间共享频率信息等）

编辑：如果从文档中看不出来，本节将帮助指导您在适当的情况下构建自定义损失函数！

  qz * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
= qz * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
= qz * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
= qz * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
= (1 - z) * x + (qz +  1 - z) * log(1 + exp(-x))
= (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(-x))

(1 - z) * x + l * (log(1 + exp(-abs(x))) + max(-x, 0))

【讨论】：

以上是关于多标签分类的 sigmoid 非线性阈值的主要内容，如果未能解决你的问题，请参考以下文章