如何始终如一地对具有变化值的数据帧进行热编码？

Posted 2023-03-12

技术标签:

【中文标题】如何始终如一地对具有变化值的数据帧进行热编码？【英文标题】：How to consistently hot encode dataframes with changing values? 【发布时间】：2018-06-10 13:39:46 【问题描述】：

我正在以数据帧的形式获得内容流，每批在列中具有不同的值。例如，一批可能如下所示：

day1_data = 'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
            'city': ['C', 'B', 'G', 'Z', 'F'], 
            'age': [27, 19, 63, 40, 93]

还有一个像：

day2_data = 'state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73]

如何以返回一致列数的方式对列进行热编码？

如果我在每个批次上使用 pandas 的 get_dummies()，它会返回不同数量的列：

df1 = pd.get_dummies(pd.DataFrame(day1_data))
df2 = pd.get_dummies(pd.DataFrame(day2_data))

len(df1.columns) == len(df2.columns)

我可以获得每列的所有可能值，问题是即使有了这些信息，为每个每日批次生成一个热编码以使列数保持一致的最简单方法是什么？

【问题讨论】：

你们两个数据源都有相同的列age、city和state。总是这样吗？如果不是，请提供一个更实际的不同列的示例。有趣的问题。您是否提前知道特定列可能包含的所有值？为什么不直接连接它们然后调用 get dummies？ @akilat90 是的，所有值都是预先知道的 - 目前我正在使用一些非常骇人听闻的东西，我查看我生成的具有所有值的合成数据框的列并添加它的交集' get_dummies' 带有任何新的数据框 - 它非常难看 - 想知道什么是更好的方法来做到这一点 @cᴏʟᴅsᴘᴇᴇᴅ 因为无法保证您到目前为止所获得的数据帧在未来可以看到所有值，并且在第一个时间段（〜第 1 个月）中缺失值的可能性将很高 【参考方案1】：

好的，因为所有可能的值都是预先知道的。然后下面是一个有点hackish的方法。

import numpy as np
import pandas as pd

# This is a one time process
# Keep all the possible data here in lists
# Can add other categorical variables too which have this type of data
all_possible_states=  ['AL', 'MS', 'MS', 'OK', 'VA', 'NJ', 'NM', 'CD', 'WY']
all_possible_cities= ['A', 'B', 'C', 'D', 'E', 'G', 'Z', 'F']

# Declare our transformer class
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

class MyOneHotEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, all_possible_values):
        self.le = LabelEncoder()
        self.ohe = OneHotEncoder()
        self.ohe.fit(self.le.fit_transform(all_possible_values).reshape(-1,1))

    def transform(self, X, y=None):
        return self.ohe.transform(self.le.transform(X).reshape(-1,1)).toarray()

# Allow the transformer to see all the data here
encoders = 
encoders['state'] = MyOneHotEncoder(all_possible_states)
encoders['city'] = MyOneHotEncoder(all_possible_cities)
# Do this for all categorical columns

# Now this is our method which will be used on the incoming data 
def encode(df):

    tup = (encoders['state'].transform(df['state']), 
           encoders['city'].transform(df['city']),
           # Add all other columns which are not to be transformed
           df[['age']])

    return np.hstack(tup)

# Testing:
day1_data = pd.DataFrame('state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
        'city': ['C', 'B', 'G', 'Z', 'F'], 
        'age': [27, 19, 63, 40, 93])

print(encode(day1_data))
[[  0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.
    0.   0.  27.]
 [  0.   0.   0.   0.   0.   1.   0.   0.   0.   1.   0.   0.   0.   0.
    0.   0.  19.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
    1.   0.  63.]
 [  0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   1.  40.]
 [  0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   1.
    0.   0.  93.]]


day2_data = pd.DataFrame('state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73])

print(encode(day2_data))
[[  1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.
    0.   0.  42.]
 [  0.   0.   0.   0.   0.   0.   0.   1.   0.   1.   0.   0.   0.   0.
    0.   0.  52.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   1.   0.
    0.   0.  73.]]

请检查 cmets，如果仍有问题，请询问我。

【讨论】：

以上是关于如何始终如一地对具有变化值的数据帧进行热编码？的主要内容，如果未能解决你的问题，请参考以下文章