如何使用 Sklearn.preprocessing 对包含列表的 pandas.DataFrame 列进行编码

Posted 2023-03-12

技术标签:

【中文标题】如何使用 Sklearn.preprocessing 对包含列表的 pandas.DataFrame 列进行编码【英文标题】：How to encode a pandas.DataFrame column containing lists using Sklearn.preprocessing 【发布时间】：2019-11-05 08:33:03 【问题描述】：

我有一个 pandas df，其中一些列是包含数据的列表，我想对列表中的标签进行编码。

我收到此错误： ValueError: Expected 2D array, got 1D array instead:

from sklearn.preprocessing import OneHotEncoder
mins = pd.read_csv('recipes.csv')

enc = OneHotEncoder(handle_unknown='ignore')

X = mins['Ingredients']

'''
[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]]
'''

enc.fit(X)

我希望得到一列包含正确编码信息的列表

[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]

[[0, 1, 2, 3, 1]
[0, 1, 4, 3, 1]
...
[4, 1, 3, 3, 9]]

【问题讨论】：

您想要实现的是标签编码。 OneHot Enconding 返回一个二进制向量。 【参考方案1】：

要对 DataFrame 系列中的列表进行标记编码，我们首先使用唯一的文本标签训练编码器，然后使用 apply 到 transform 每个文本标签到列表列表中训练的整数标签。这是一个例子：

In [2]: import pandas as pd

In [3]: from sklearn import preprocessing

In [4]: df = pd.DataFrame("Day":["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"], "Veggies&Drinks":[["lettuce"
   ...: , "tomatoes", "ginger", "vodka", "tomatoes"], ["flour", "vodka", "mustard", "lettuce", "ginger"], ["mustard", "
   ...: tomatoes", "ginger", "vodka", "tomatoes"], ["ginger", "vodka", "lettuce", "tomatoes", "flour"], ["mustard", "le
   ...: ttuce", "ginger", "flour", "tomatoes"]])

In [5]: df
Out[5]:
         Day                                Veggies&Drinks
0     Monday  [lettuce, tomatoes, ginger, vodka, tomatoes]
1    Tuesday      [flour, vodka, mustard, lettuce, ginger]
2  Wednesday  [mustard, tomatoes, ginger, vodka, tomatoes]
3   Thursday     [ginger, vodka, lettuce, tomatoes, flour]
4     Friday   [mustard, lettuce, ginger, flour, tomatoes]

In [9]: label_encoder = preprocessing.LabelEncoder()

In [19]: list_of_veggies_drinks = ["lettuce","tomatoes","ginger","vodka","flour","mustard"]

In [20]: label_encoder.fit(list_of_veggies_drinks)
Out[20]: LabelEncoder()

In [21]: integer_encoded = df["Veggies&Drinks"].apply(lambda x:label_encoder.transform(x))

In [22]: integer_encoded
Out[22]:
0    [2, 4, 1, 5, 4]
1    [0, 5, 3, 2, 1]
2    [3, 4, 1, 5, 4]
3    [1, 5, 2, 4, 0]
4    [3, 2, 1, 0, 4]
Name: Veggies&Drinks, dtype: object

In [23]: df["Encoded"] = integer_encoded

In [24]: df
Out[24]:
         Day                                Veggies&Drinks          Encoded
0     Monday  [lettuce, tomatoes, ginger, vodka, tomatoes]  [2, 4, 1, 5, 4]
1    Tuesday      [flour, vodka, mustard, lettuce, ginger]  [0, 5, 3, 2, 1]
2  Wednesday  [mustard, tomatoes, ginger, vodka, tomatoes]  [3, 4, 1, 5, 4]
3   Thursday     [ginger, vodka, lettuce, tomatoes, flour]  [1, 5, 2, 4, 0]
4     Friday   [mustard, lettuce, ginger, flour, tomatoes]  [3, 2, 1, 0, 4]

【讨论】：

感谢您的回答，我已经编辑了这个问题。如果 pandas 列是每一行的列表怎么办。考虑到所有其他列表中的所有唯一标签，如何标记编码列表的所有内容？【参考方案2】：

既然你想直接应用到pandas.DataFrame：

from sklearn.preprocessing import LabelEncoder

# Get a flat list with all the ingredients
all_ingr = mins.Ingredients.apply(pd.Series).stack().values

enc = LabelEncoder()
enc.fit(all_ingr)

mins['Ingredients_enc'] = mins.Ingredients.apply(enc.transform)

【讨论】：

以上是关于如何使用 Sklearn.preprocessing 对包含列表的 pandas.DataFrame 列进行编码的主要内容，如果未能解决你的问题，请参考以下文章