Python - Pandas，重新采样数据集以具有平衡的类

Posted 2023-03-12

技术标签:

【中文标题】Python - Pandas，重新采样数据集以具有平衡的类【英文标题】：Python - Pandas, Resample dataset to have balanced classes 【发布时间】：2019-03-15 01:50:11 【问题描述】：

使用以下数据框，只有 2 个可能的标签：

   name  f1  f2  label
0     A   8   9      1
1     A   5   3      1
2     B   8   9      0
3     C   9   2      0
4     C   8   1      0
5     C   9   1      0
6     D   2   1      0
7     D   9   7      0
8     D   3   1      0
9     E   5   1      1
10    E   3   6      1
11    E   7   1      1

我编写了一个代码来按“名称”列对数据进行分组并将结果转换为一个 numpy 数组，因此每一行都是特定组的所有样本的集合，而标签是另一个 numpy 数组：

数据：

[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
[[2 1] [9 7] [3 1]] # D lable = 0
[[5 1] [3 6] [7 1]] # E lable = 1

标签：

[[1]
 [0]
 [0]
 [0]
 [1]]

代码：

import pandas as pd
import numpy as np


def prepare_data(group_name):
    df = pd.read_csv("../data/tmp.csv")


    group_index = df.groupby(group_name).cumcount()
    data = (df.set_index([group_name, group_index])
            .unstack(fill_value=0).stack())



    target = np.array(data['label'].groupby(level=0).apply(lambda x: [x.values[0]]).tolist())
    data = data.loc[:, data.columns != 'label']
    data = np.array(data.groupby(level=0).apply(lambda x: x.values.tolist()).tolist())
    print(data)
    print(target)


prepare_data('name')

我想从过度代表的类中重新采样并删除实例。

即

[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
# group D was deleted randomly from the '0' labels 
[[5 1] [3 6] [7 1]] # E lable = 1

将是一个可接受的解决方案，因为删除 D（标记为“0”）将产生 2 * 标签“1”和 2 * 标签“0”的平衡数据集。

【问题讨论】：

imbalanced-learn 包有很好的过采样/欠采样实用程序。我不知道为什么D 被删除了？你能定义“过度代表”吗？ D 是随机选择的，只是因为它具有类 '0'，删除一个 '0' 样本提供了 2(1) 和 2(0) 的平衡数据集。名称和标签总是匹配的情况吗？例如我们可以有name = A 和label = 0 的一行和name = A 和label = 1 的另一行不，name 是唯一值，例如 A 只能是 1 或 0，不能同时是两者 【参考方案1】：

一个非常简单的方法。取自 sklearn 文档和 Kaggle。

from sklearn.utils import resample

df_majority = df[df.label==0]
df_minority = df[df.label==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=20,    # to match majority class
                                 random_state=42) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.label.value_counts()

【讨论】：

【参考方案2】：

如果每个name 都由一个label 标记（例如，所有A 都是1），您可以使用以下内容：

name

label

代码如下：

labels = df.groupby('label').name.unique()
# Sort the over-represented class to the head.
labels = labels[labels.apply(len).sort_values(ascending=False).index]
excess = len(labels.iloc[0]) - len(labels.iloc[1])
remove = np.random.choice(labels.iloc[0], excess, replace=False)
df2 = df[~df.name.isin(remove)]

【讨论】：

【参考方案3】：

使用imbalanced-learn (pip install imbalanced-learn)，这很简单：

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy='not minority', random_state=1)
df_balanced, balanced_labels = rus.fit_resample(df, df['label'])

除了RandomUnderSampler之外还有很多方法，建议大家阅读文档

【讨论】：

注意：改用pip install imbalanced-learn（imblearn 是一个废弃的项目，甚至建议：“改用不平衡学习。”）。【参考方案4】：

您还可以根据少数类从多数类中抽样：


### Separate the majority and minority classes
df_miority  = df[df['label']==1]
df_majority = df[df['label']==0]

### Now, downsamples majority labels equal to the number of samples in the minority class

df_majority = df_majority.sample(len(df_minority), random_state=0)

### concat the majority and minority dataframes
df = pd.concat([df_majority,df_minority])

## Shuffle the dataset to prevent the model from getting biased by similar samples
df = df.sample(frac=1, random_state=0)

【讨论】：

【参考方案5】：

您可以使用分组表示进行重采样。

def balance_df(frame: pd.DataFrame, col: str, upsample_minority: bool):
    grouped = frame.groupby(col)
    n_samp = 
        True: grouped.size().max(),
        False: grouped.size().min(),
    [upsample_minority]

    fun = lambda x: x.sample(n_samp, replace=upsample_minority)
    balanced = grouped.apply(fun)
    balanced = balanced.reset_index(drop=True)
    return balanced

【讨论】：

以上是关于Python - Pandas，重新采样数据集以具有平衡的类的主要内容，如果未能解决你的问题，请参考以下文章