使用带有 Pandas DataFrame 的 Scikit-Learn OneHotEncoder

Posted

技术标签:

【中文标题】使用带有 Pandas DataFrame 的 Scikit-Learn OneHotEncoder【英文标题】:Using Scikit-Learn OneHotEncoder with a Pandas DataFrame 【发布时间】:2020-01-25 19:22:30 【问题描述】:

我正在尝试使用 Scikit-Learn 的 OneHotEncoder 将 Pandas DataFrame 中包含字符串的列替换为单热编码等效项。我下面的代码不起作用:

from sklearn.preprocessing import OneHotEncoder
# data is a Pandas DataFrame

jobs_encoder = OneHotEncoder()
jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

它会产生以下错误(列表中的字符串被省略):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-3a1f568322f5> in <module>()
      3 jobs_encoder = OneHotEncoder()
      4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    730                                        copy=True)
    731         else:
--> 732             return self._transform_new(X)
    733 
    734     def inverse_transform(self, X):

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
    678         """New implementation assuming categorical input"""
    679         # validation of X happens in _check_X called by _transform
--> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
    681 
    682         n_samples, n_features = X_int.shape

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    120                     msg = ("Found unknown categories 0 in column 1"
    121                            " during transform".format(diff, i))
--> 122                     raise ValueError(msg)
    123                 else:
    124                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform

以下是一些示例数据:

data['Profession'] =

0         unkn
1         safe
2         rece
3         unkn
4         lead
          ... 
111988    indu
111989    seni
111990    mess
111991    seni
111992    proj
Name: Profession, Length: 111993, dtype: object

我到底做错了什么?

【问题讨论】:

请包括完整错误跟踪,以及您的data['Profession']的样本。 一个热编码器会返回一个大小为data_length x num_categories的二维数组。您不能分配给单个列 df['Profession'] 对 dd 答案的跟进。我们可以将 OneHotEncoder 用于多列数据,而不能用于 LabelBinarizer 和 LabelEncoder。 ***.com/a/54119850/1582366 【参考方案1】:

OneHotEncoder 将分类整数特征编码为 one-hot 数值数组。如果sparse=True,它的Transform方法返回一个稀疏矩阵,否则返回一个二维数组。

您不能将 二维数组(或稀疏矩阵)转换为 Pandas 系列。您必须为每个类别创建一个 Pandas Serie(Pandas 数据帧中的一列)。

我会推荐pandas.get_dummies:

data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)

编辑:

使用 Sklearn OneHotEncoder:

transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

其他选项:如果您使用GridSearch 进行超参数调整,建议使用ColumnTransformer 和FeatureUnion 和Pipeline 或直接使用make_column_transformer

【讨论】:

我希望能够腌制实例以在将来在新数据上使用它,这就是我想使用 OneHotEncoder 的原因,这不能用 get_dummies 来完成,对吧? 没错。如果你想在新数据上使用它,你不能使用 get_dummies。【参考方案2】:

原来 Scikit-Learns LabelBinarizer 在 Amnie's solution 的帮助下将数据转换为 one-hot 编码格式给了我更好的运气,我的最终代码如下

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Profession'])
transformed = jobs_encoder.transform(data['Profession'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

【讨论】:

【参考方案3】:

以下是 Kaggle Learn 建议的一种方法。不要认为目前有更简单的方法可以从原始的 pandas DataFrame 变为 one-hot 编码的 DataFrame

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print(OH_X_train)

【讨论】:

【参考方案4】:

这样就可以了。如果您对即不感兴趣,请删除情节部分。如果您不需要降价,也将 printmd 更改为打印。

def fn_cat_onehot(df):

    """Generate onehoteencoded features for all categorical columns in df"""

    printmd(f"df shape: df.shape")

    # NaN handing
    nan_count = df.isna().sum().sum()
    if nan_count > 0:
        printmd(f"NaN = **nan_count** will be categorized under feature_nan columns")

    # generation
    from sklearn.preprocessing import OneHotEncoder

    model_oh = OneHotEncoder(handle_unknown="ignore", sparse=False)
    for c in df.select_dtypes("category").columns:
        printmd(f"Encoding **c**")  # which column
        matrix = model_oh.fit_transform(
            df[[c]]
        )  # get a matrix of new features and values
        names = model_oh.get_feature_names_out()  # get names for these features
        df_oh = pd.DataFrame(
            data=matrix, columns=names, index=df.index
        )  # create df of these new features
        display(df_oh.plot.hist())
        df = pd.concat([df, df_oh], axis=1)  # concat with existing df
        df.drop(
            c, axis=1, inplace=True
        )  # drop categorical column so that it is all numerical for modelling

    printmd(f"#### New df shape: **df.shape**")
    return df

【讨论】:

以上是关于使用带有 Pandas DataFrame 的 Scikit-Learn OneHotEncoder的主要内容,如果未能解决你的问题,请参考以下文章

带有滚动窗口的 Pandas Dataframe 枢轴

带有嵌套字典的 Pandas DataFrame

带有分类标记的行/列的散点图 Pandas DataFrame

pandas - 带有外连接的 DataFrame 扩展

使用带有 Pandas DataFrame 的 Scikit-Learn OneHotEncoder

使用带有pandas dataframe列的条件if / else逻辑