pandas.get_dummies

Posted 已删除ddd

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pandas.get_dummies相关的知识,希望对你有一定的参考价值。

Convert categorical variable into dummy/indicator variables

参数

data : array-like, Series, or DataFrame

prefix : string, list of strings, or dict of strings, default None
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
prefix_sep : string, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.
sparse : bool, default False
Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks.
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the first level.


New in version 0.18.0.
dtype : dtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.


New in version 0.23.0.

创建一个DataFrame

dataframe = pd.DataFrame("A":["China", "Japan", "India"], "B":["UK", "French", "Germany"])
print(dataframe)
       A        B
0  China       UK
1  Japan   French
2  India  Germany

基本用法,将标称型特征使用One-Hot方法进行编码

dataframe = pd.get_dummies(dataframe)
print(dataframe)
输出
  A_China  A_India  A_Japan  B_French  B_Germany  B_UK
0        1        0        0         0          0     1
1        0        0        1         1          0     0
2        0        1        0         0          1     0

缺失值处理 dummy_na

1.忽略缺失值

import numpy as np
dataframe = pd.DataFrame("A":["China", "Japan", np.nan], "B":["UK", "French", "Germany"])
print(dataframe)
df_nan_ignore = pd.get_dummies(dataframe)
print(df_nan_ignore)
输出
 A        B
0  China       UK
1  Japan   French
2    NaN  Germany
   A_China  A_Japan  B_French  B_Germany  B_UK
0        1        0         0          0     1
1        0        1         1          0     0
2        0        0         0          1     0

默认情况下,get_dummies()不处理缺失值

2.将缺失值当做类型

df_nan_as_type = pd.get_dummies(dataframe, dummy_na=True)
print(df_nan_as_type)
输出
A_China  A_Japan  A_nan  B_French  B_Germany  B_UK  B_nan
0        1        0      0         0          0     1      0
1        0        1      0         1          0     0      0
2        0        0      1         0          1     0      0

前缀 prefix

dataframe = pd.DataFrame("A":["China", "Japan", np.nan], "B":["UK", "French", "Germany"])
df_prefix = pd.get_dummies(dataframe, prefix=['Aisa', 'Europe'])
print(df_prefix)
输出
Aisa_China  Aisa_Japan  Europe_French  Europe_Germany  Europe_UK
0           1           0              0               0          1
1           0           1              1               0          0
2           0           0              0               1          0

同上面的结果最对比,新产生的特征均以perfix参数指定的数值进行命名,默认情况下使用原始列名命名

前缀分隔符 perfix_sep

df_prefix_sep = pd.get_dummies(dataframe, prefix=['Aisa', 'Europe'], prefix_sep ='.')
print(df_prefix_sep)
输出
Aisa.China  Aisa.Japan  Europe.French  Europe.Germany  Europe.UK
0           1           0              0               0          1
1           0           1              1               0          0
2           0           0              0               1          0

以上是关于pandas.get_dummies的主要内容,如果未能解决你的问题,请参考以下文章

特征提取pd.get_dummies() 详解(One-Hot Encoding)

Pandas.get_dummies 返回两列(_Y 和 _N)而不是一列

如何让 pandas get_dummies 发出 N-1 个变量以避免共线性?

pandas get_dummies 如何记住哪个值变成了哪个新类别? [复制]

停止使用 Pandas get_dummies() 进行特征编码

如何将 pandas get_dummies 函数应用于有效数据集?