Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别

Posted 一个处女座的程序猿

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别相关的知识,希望对你有一定的参考价值。


Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别


目录

​深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display​

​读取源码​

​理解源代码​

​data与raw_data对比结果​

​X.shape ​

​X_display.shape ​


深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

X,y = shap.datasets.adult()
X_display,y_display = shap.datasets.adult(display=True)

读取源码

def adult(display=False):
""" Return the Adult census data in a nice package. """
dtypes = [
("Age", "float32"), ("Workclass", "category"), ("fnlwgt", "float32"),
("Education", "category"), ("Education-Num", "float32"), ("Marital Status", "category"),
("Occupation", "category"), ("Relationship", "category"), ("Race", "category"),
("Sex", "category"), ("Capital Gain", "float32"), ("Capital Loss", "float32"),
("Hours per week", "float32"), ("Country", "category"), ("Target", "category")
]
raw_data = pd.read_csv(
cache(github_data_url + "adult.data"),
names=[d[0] for d in dtypes],
na_values="?",
dtype=dict(dtypes)
)
data = raw_data.drop(["Education"], axis=1) # redundant with Education-Num
filt_dtypes = list(filter(lambda x: not (x[0] in ["Target", "Education"]), dtypes))
data["Target"] = data["Target"] == " >50K"
rcode =
"Not-in-family": 0,
"Unmarried": 1,
"Other-relative": 2,
"Own-child": 3,
"Husband": 4,
"Wife": 5

for k, dtype in filt_dtypes:
if dtype == "category":
if k == "Relationship":
data[k] = np.array([rcode[v.strip()] for v in data[k]])
else:
data[k] = data[k].cat.codes

if display:
return raw_data.drop(["Education", "Target", "fnlwgt"], axis=1), data["Target"].values
return data.drop(["Target", "fnlwgt"], axis=1), data["Target"].values

理解源代码

Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别_python

Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别_机器学习_02

data与raw_data对比结果

结论
data:是基于raw_data读入的csv文件数据,为新定义的新数据,共计drop了3列(第1个红色矩形框),又进行了目标特征的二分类(第2个红色矩形框),最后进行了类别特征进行了数值化/编码化(第3个红色矩形框);经过处理后的数据均为数字列目标特征为二分类的dataframe。
raw_data:为原始数据,从csv读入,仅经过drop了3列,其余原封不同输出数据。

X.shape 

(32561, 12) X.shape 
age workclass ... hours-per-week native-country
0 39 State-gov ... 40 United-States
1 50 Self-emp-not-inc ... 13 United-States
2 38 Private ... 40 United-States
3 53 Private ... 40 United-States
4 28 Private ... 40 Cuba
... ... ... ... ... ...
32556 27 Private ... 38 United-States
32557 40 Private ... 40 United-States
32558 58 Private ... 40 United-States
32559 22 Private ... 20 United-States
32560 52 Self-emp-inc ... 40 United-States

[32561 rows x 12 columns]

age

workclass

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

0

39

State-gov

13

Never-married

Adm-clerical

Not-in-family

White

Male

2174

0

40

United-States

1

50

Self-emp-not-inc

13

Married-civ-spouse

Exec-managerial

Husband

White

Male

0

0

13

United-States

2

38

Private

9

Divorced

Handlers-cleaners

Not-in-family

White

Male

0

0

40

United-States

3

53

Private

7

Married-civ-spouse

Handlers-cleaners

Husband

Black

Male

0

0

40

United-States

4

28

Private

13

Married-civ-spouse

Prof-specialty

Wife

Black

Female

0

0

40

Cuba

5

37

Private

14

Married-civ-spouse

Exec-managerial

Wife

White

Female

0

0

40

United-States

6

49

Private

5

Married-spouse-absent

Other-service

Not-in-family

Black

Female

0

0

16

Jamaica

7

52

Self-emp-not-inc

9

Married-civ-spouse

Exec-managerial

Husband

White

Male

0

0

45

United-States

8

31

Private

14

Never-married

Prof-specialty

Not-in-family

White

Female

14084

0

50

United-States

9

42

Private

13

Married-civ-spouse

Exec-managerial

Husband

White

Male

5178

0

40

United-States

X_display.shape 

(32561, 12) X_display.shape 
age workclass ... hours-per-week native-country
0 39 State-gov ... 40 United-States
1 50 Self-emp-not-inc ... 13 United-States
2 38 Private ... 40 United-States
3 53 Private ... 40 United-States
4 28 Private ... 40 Cuba
... ... ... ... ... ...
32556 27 Private ... 38 United-States
32557 40 Private ... 40 United-States
32558 58 Private ... 40 United-States
32559 22 Private ... 20 United-States
32560 52 Self-emp-inc ... 40 United-States

[32561 rows x 12 columns]

age

workclass

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

0

39

State-gov

13

Never-married

Adm-clerical

Not-in-family

White

Male

2174

0

40

United-States

1

50

Self-emp-not-inc

13

Married-civ-spouse

Exec-managerial

Husband

White

Male

0

0

13

分享《深入浅出深度学习:原理剖析与python实践》PDF+源代码

深度剖析HMM(附Python代码)1.前言及隐马尔科夫链HMM的背景

函数式编程与Js异步编程手写Promise

HDFS API编程副本系数深度剖析

干货分享:深度学习框架技术剖析

深度学习论文解读系列--Dropout原理剖析

(c)2006-2024 SYSTEM All Rights Reserved IT常识