Keras：one-hot 编码的类权重（class_weight）

Posted 2023-03-13

技术标签:

【中文标题】Keras：one-hot 编码的类权重（class_weight）【英文标题】：Keras: class weights (class_weight) for one-hot encoding 【发布时间】：2017-09-14 20:13:27 【问题描述】：

我想在 keras model.fit 中使用 class_weight 参数来处理不平衡的训练数据。通过查看一些文档，我了解到我们可以像这样传递字典：

class_weight = 0 : 1,
    1: 1,
    2: 5

（在这个例子中，class-2 会在损失函数中得到更高的惩罚。）

问题是我的网络输出有 one-hot 编码，即 class-0 = (1, 0, 0), class-1 = (0, 1, 0) 和 class-3 = (0, 0, 1).

我们如何将 class_weight 用于 one-hot 编码输出？

通过查看some codes in Keras，看起来_feed_output_names 包含输出类列表，但在我的情况下，model.output_names/model._feed_output_names 返回['dense_1']

相关：How to set class weights for imbalanced classes in Keras?

【问题讨论】：

【参考方案1】：

这是一个更短、更快的解决方案。如果您的 one-hot 编码 y 是 np.array：

import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_integers = np.argmax(y, axis=1)
class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
d_class_weights = dict(enumerate(class_weights))

d_class_weights 然后可以在.fit 中传递给class_weight。

【讨论】：

对我来说，将 class_weight 作为一个 numpy 数组传递，而不将其转换为在 model.fit() 函数中工作的字典。 @tsveti_iko 发送类权重不适用于一个热编码 elif isinstance(class_weight, dict): ... else: return np.ones((y.shape[0],), dtype=K.floatx()) 您必须为其提供样本权重，例如 class_weight.compute_sample_weight('balanced', y_train) 使用 sklearn【参考方案2】：

我想我们可以改用sample_weights。实际上，在 Keras 内部，class_weights 被转换为 sample_weights。

sample_weight：与 x 长度相同的可选数组，包含应用于每个样本的模型损失的权重。如果是时态数据，您可以传递具有形状的二维数组（样本， sequence_length)，对每个时间步应用不同的权重每个样本。在这种情况下，您应该确保指定 compile() 中的 sample_weight_mode="temporal"。

https://github.com/fchollet/keras/blob/d89afdfd82e6e27b850d910890f4a4059ddea331/keras/engine/training.py#L1392

【讨论】：

sample_weight_mode="temporal" 如何帮助处理多类 one-hot 编码目标？您是否知道如何处理每个样本可以占用多个班级的情况？谢谢【参考方案3】：

有点复杂的答案，但到目前为止我找到的最好的答案。这假设您的数据是 one-hot 编码的、多类的，并且仅适用于标签 DataFrame df_y：

import pandas as pd
import numpy as np

# Create a pd.series that represents the categorical class of each one-hot encoded row
y_classes = df_y.idxmax(1, skipna=False)

from sklearn.preprocessing import LabelEncoder

# Instantiate the label encoder
le = LabelEncoder()

# Fit the label encoder to our label series
le.fit(list(y_classes))

# Create integer based labels Series
y_integers = le.transform(list(y_classes))

# Create dict of labels : integer representation
labels_and_integers = dict(zip(y_classes, y_integers))

from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight

class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
sample_weights = compute_sample_weight('balanced', y_integers)

class_weights_dict = dict(zip(le.transform(list(le.classes_)), class_weights))

这会产生一个 sample_weights 向量来平衡一个不平衡的数据集，该数据集可以传递给 Keras sample_weight 属性，以及一个 class_weights_dict 可以馈送到 .fit 中的 Keras class_weight 属性方法。您真的不想同时使用两者，只需选择一个即可。我现在正在使用class_weight，因为让sample_weight 与fit_generator 一起工作很复杂。

【讨论】：

对我来说，将 class_weight 作为一个 numpy 数组传递，而不将其转换为在 model.fit() 函数中工作的字典。【参考方案4】：

在_standardize_weights 中，keras 会：

if y.shape[1] > 1:
    y_classes = y.argmax(axis=1)

所以基本上，如果您选择使用 one-hot 编码，那么类就是列索引。

您也可能会问自己如何将列索引映射到数据的原始类。好吧，如果你使用 scikit learn 的 LabelEncoder 类来执行 one-hot 编码，列索引映射了 .fit 函数计算的 unique labels 的顺序。医生说

提取唯一标签的有序数组

例子：

from sklearn.preprocessing import LabelBinarizer
y=[4,1,2,8]
l=LabelBinarizer()
y_transformed=l.fit_transorm(y)
y_transormed
> array([[0, 0, 1, 0],
   [1, 0, 0, 0],
   [0, 1, 0, 0],
   [0, 0, 0, 1]])
l.classes_
> array([1, 2, 4, 8])

作为结论，class_weights 字典的键应该反映编码器的classes_ 属性中的顺序。

【讨论】：

以上是关于Keras：one-hot 编码的类权重（class_weight）的主要内容，如果未能解决你的问题，请参考以下文章

Keras scikit-learn 包装器在使用 one-hot 编码标签的交叉验证中的评分指标

自动编码器的解码器权重与 Keras 中的权重绑定

在密集的 Keras 层中绑定自动编码器权重

如何为 Keras 计算 Pandas DataFrame 的类权重？

Keras 如何处理多标签分类？

如何在 Keras 中进行逐点分类交叉熵损失？