pandas.get_dummies
Posted 已删除ddd
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pandas.get_dummies相关的知识,希望对你有一定的参考价值。
Convert categorical variable into dummy/indicator variables
参数
data : array-like, Series, or DataFrame | |
prefix : string, list of strings, or dict of strings, default None | String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes. |
prefix_sep : string, default ‘_’ | If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix. |
dummy_na : bool, default False | Add a column to indicate NaNs, if False NaNs are ignored. |
columns : list-like, default None | Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted. |
sparse : bool, default False | Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks. |
drop_first : bool, default False | Whether to get k-1 dummies out of k categorical levels by removing the first level. New in version 0.18.0. |
dtype : dtype, default np.uint8 | Data type for new columns. Only a single dtype is allowed. New in version 0.23.0. |
创建一个DataFrame
dataframe = pd.DataFrame("A":["China", "Japan", "India"], "B":["UK", "French", "Germany"])
print(dataframe)
A B
0 China UK
1 Japan French
2 India Germany
基本用法,将标称型特征使用One-Hot方法进行编码
dataframe = pd.get_dummies(dataframe)
print(dataframe)
输出
A_China A_India A_Japan B_French B_Germany B_UK
0 1 0 0 0 0 1
1 0 0 1 1 0 0
2 0 1 0 0 1 0
缺失值处理 dummy_na
1.忽略缺失值
import numpy as np
dataframe = pd.DataFrame("A":["China", "Japan", np.nan], "B":["UK", "French", "Germany"])
print(dataframe)
df_nan_ignore = pd.get_dummies(dataframe)
print(df_nan_ignore)
输出
A B
0 China UK
1 Japan French
2 NaN Germany
A_China A_Japan B_French B_Germany B_UK
0 1 0 0 0 1
1 0 1 1 0 0
2 0 0 0 1 0
默认情况下,get_dummies()不处理缺失值
2.将缺失值当做类型
df_nan_as_type = pd.get_dummies(dataframe, dummy_na=True)
print(df_nan_as_type)
输出
A_China A_Japan A_nan B_French B_Germany B_UK B_nan
0 1 0 0 0 0 1 0
1 0 1 0 1 0 0 0
2 0 0 1 0 1 0 0
前缀 prefix
dataframe = pd.DataFrame("A":["China", "Japan", np.nan], "B":["UK", "French", "Germany"])
df_prefix = pd.get_dummies(dataframe, prefix=['Aisa', 'Europe'])
print(df_prefix)
输出
Aisa_China Aisa_Japan Europe_French Europe_Germany Europe_UK
0 1 0 0 0 1
1 0 1 1 0 0
2 0 0 0 1 0
同上面的结果最对比,新产生的特征均以perfix参数指定的数值进行命名,默认情况下使用原始列名命名
前缀分隔符 perfix_sep
df_prefix_sep = pd.get_dummies(dataframe, prefix=['Aisa', 'Europe'], prefix_sep ='.')
print(df_prefix_sep)
输出
Aisa.China Aisa.Japan Europe.French Europe.Germany Europe.UK
0 1 0 0 0 1
1 0 1 1 0 0
2 0 0 0 1 0
以上是关于pandas.get_dummies的主要内容,如果未能解决你的问题,请参考以下文章
特征提取pd.get_dummies() 详解(One-Hot Encoding)
Pandas.get_dummies 返回两列(_Y 和 _N)而不是一列
如何让 pandas get_dummies 发出 N-1 个变量以避免共线性?
pandas get_dummies 如何记住哪个值变成了哪个新类别? [复制]