kaggle-泰坦尼克号Titanic-2

Posted Freeman耀

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了kaggle-泰坦尼克号Titanic-2相关的知识,希望对你有一定的参考价值。

下面我们再来看看各种舱级别情况下各性别的获救情况

 1 fig = plt.figure()
 2 fig.set(alpha=0.5)
 3 plt.title(u"根据舱等级和性别的获救情况",fontproperties=getChineseFont())
 4 
 5 ax1 = fig.add_subplot(141)
 6 data_train.Survived[data_train.Sex == \'female\'][data_train.Pclass != 3].value_counts().plot(kind=\'bar\', label="female highclass", color=\'#FA2479\')
 7 
 8 ax1.set_xticklabels([\'survived\',\'unsurvived\'],rotation=0)
 9 ax1.legend(["female/hight_level"], loc=\'best\')
10 
11 ax2=fig.add_subplot(142, sharey=ax1)
12 data_train.Survived[data_train.Sex == \'female\'][data_train.Pclass == 3].value_counts().plot(kind=\'bar\', label=\'female, low class\', color=\'pink\')
13 ax2.set_xticklabels(["unsurvived", "survived"], rotation=0)
14 plt.legend(["female/low_level"], loc=\'best\')
15 
16 ax3=fig.add_subplot(143, sharey=ax1)
17 data_train.Survived[data_train.Sex == \'male\'][data_train.Pclass != 3].value_counts().plot(kind=\'bar\', label=\'male, high class\',color=\'lightblue\')
18 ax3.set_xticklabels(["unsurvived", "survived"], rotation=0)
19 plt.legend(["male/hight_level"], loc=\'best\')
20 
21 ax4=fig.add_subplot(144, sharey=ax1)
22 data_train.Survived[data_train.Sex == \'male\'][data_train.Pclass == 3].value_counts().plot(kind=\'bar\', label=\'male low class\', color=\'steelblue\')
23 ax4.set_xticklabels(["unsurvived", "survived"], rotation=0)
24 plt.legend(["male/low_level"], loc=\'best\')
25 
26 plt.show()

得到下图

下面再看看大家族对结果有什么影响

1 g = data_train.groupby([\'SibSp\',\'Survived\'])
2 df = pd.DataFrame(g.count()[\'PassengerId\'])
3 
4 print(df)

 

 

PassengerId

SibSp

Survived

 

0

0

398

1

210

1

0

97

1

112

2

0

15

1

13

3

0

12

1

4

4

0

15

1

3

5

0

5

8

0

7

1 g = data_train.groupby([\'Parch\',\'Survived\'])
2 df = pd.DataFrame(g.count()[\'PassengerId\'])
3 print(df)

PassengerId

Parch

Survived

 

0

0

445

1

233

1

0

53

1

65

2

0

40

1

40

3

0

2

1

3

4

0

4

5

0

4

1

1

6

0

1

 

基本没看出什么特殊关系,暂时作为备选特征。

ticket是船票编号,应该是unique的,和最后的结果没有太大的关系,不纳入考虑的特征范畴
cabin只有204个乘客有值,我们先看看它的一个分布

分布不均匀,应该算作类目型的,本身缺失值就多,还如此不集中,注定很棘手。如果直接按照类目特征处理,太散了,估计每个因子化后的特征都得不到什么权重。加上这么多缺失值,要不先把cabin缺失与否作为条件(虽然这部分信息缺失可能并非未登记,可能只是丢失而已,所以这样做未必妥当)。先在有无cabin信息这个粗粒度上看看Survived的情况。

 1 #cabin的值计数太分散了,绝大多数Cabin值只出现一次。感觉上作为类目,加入特征未必会有效
 2 #那我们一起看看这个值的有无,对于survival的分布状况,影响如何吧
 3 fig = plt.figure()
 4 fig.set(alpha=0.2)  # 设定图表颜色alpha参数
 5 
 6 Survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()
 7 Survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()
 8 df=pd.DataFrame({\'Notnull\':Survived_cabin, \'null\':Survived_nocabin}).transpose()
 9 df.plot(kind=\'bar\', stacked=True)
10 plt.title(u"按Cabin有无看获救情况",fontproperties=getChineseFont())
11 plt.xlabel(u"Cabin有无",fontproperties=getChineseFont())
12 plt.ylabel(u"人数",fontproperties=getChineseFont())
13 plt.show()
14 
15 #似乎有cabin记录的乘客survival比例稍高,那先试试把这个值分为两类,有cabin值/无cabin值,一会儿加到类别特征好了

似乎有cabin的存活率高一些。

因此,我们从最明显突出的数据属性开始,cabin和age,有丢失数据对进一步研究影响较大。

Cabin:暂时按照上面分析的,按Cabin有无数据,将这个属性处理成Ye和No两种类型。

Age:对于年龄缺失,我们会有以下几种处理方法

1.如果缺失的样本占总数比例极高,可能就要直接舍弃了,作为特征加入的话,可能导致噪声的产生,影响最终结果。

2.如果缺失值样本适中,并且该属性非连续值特征属性,那就把NaN作为一个新类别,加到类别特征中。

3.如果缺失值样本适中,而该属性为连续值特征属性,有时候我们会考虑给定一个step(比如这里的age,可以考虑每隔2/3岁为一个步长),然后把它离散化之后把NaN作为一个type加到属性类目中。

4.有些情况下,缺失值个数并不多,也可以试着根据已有的值,拟合一下数据补充上。

本例中,后两种方式应该都是可行的,我们先试着补全。

我们使用scikit-learn中的RandomForest拟合一下缺失的年龄数据

 1 def set_missing_ages(df):
 2     \'\'\'
 3     使用RandomForestClassifier填充缺失的年龄
 4     :param df:
 5     :return:
 6     \'\'\'
 7     #把已有的数值型特征取出来丢进Random Forest Regressor中
 8     age_df = df[[\'Age\',\'Fare\',\'Parch\',\'SibSp\',\'Pclass\']]
 9     #乘客分成已知年龄和未知年龄两部分
10     known_age = age_df[age_df.Age.notnull()].as_matrix()
11     unknown_age = age_df[age_df.Age.isnull()].as_matrix()
12 
13     y = known_age[:,0]#y即目标年龄
14     X = known_age[:,1:]#X即特征属性值
15 
16     rfr = RandomForestRegressor(random_state=0,n_estimators=2000,n_jobs=-1)
17     rfr.fit(X,y)
18 
19     predictedAges = rfr.predict(unknown_age[:,1::])
20     df.loc[(df.Age.isnull()),\'Age\'] = predictedAges
21     return df,rfr
22 
23 
24 def set_Cabin_type(df):
25     #有客舱信息的为Yes,无客舱信息的为No
26     df.loc[(df.Cabin.notnull()), \'Cabin\'] = "Yes"
27     df.loc[(df.Cabin.isnull()), \'Cabin\'] = "No"
28     return df
29 
30 data_train, rfr = set_missing_ages(data_train)
31 data_train = set_Cabin_type(data_train)
32 print(data_train)

 

PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

22.000000

1

0

A/5 21171

7.2500

No

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

38.000000

1

0

PC 17599

71.2833

Yes

C

2

3

1

3

Heikkinen, Miss. Laina

female

26.000000

0

0

STON/O2. 3101282

7.9250

No

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35.000000

1

0

113803

53.1000

Yes

S

4

5

0

3

Allen, Mr. William Henry

male

35.000000

0

0

373450

8.0500

No

S

5

6

0

3

Moran, Mr. James

male

23.828953

0

0

330877

8.4583

No

Q

6

7

0

1

McCarthy, Mr. Timothy J

male

54.000000

0

0

17463

51.8625

Yes

S

7

8

0

3

Palsson, Master. Gosta Leonard

male

2.000000

3

1

349909

21.0750

No

S

8

9

1

3

Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)

female

27.000000

0

2

347742

11.1333

No

S

9

10

1

2

Nasser, Mrs. Nicholas (Adele Achem)

female

14.000000

1

0

237736

30.0708

No

C

10

11

1

3

Sandstrom, Miss. Marguerite Rut

female

4.000000

1

1

PP 9549

16.7000

Yes

S

11

12

1

1

Bonnell, Miss. Elizabeth

female

58.000000

0

0

113783

26.5500

Yes

S

12

13

0

3

Saundercock, Mr. William Henry

male

20.000000

0

0

A/5. 2151

8.0500

No

S

13

14

0

3

Andersson, Mr. Anders Johan

male

39.000000

1

5

347082

31.2750

No

S

14

15

0

3

Vestrom, Miss. Hulda Amanda Adolfina

female

14.000000

0

0

350406

7.8542

No

S

15

16

1

2

Hewlett, Mrs. (Mary D Kingcome)

female

55.000000

0

0

248706

16.0000

No

S

16

17

0

3

Rice, Master. Eugene

male

2.000000

4

1

382652

29.1250

No

Q

17

18

1

2

Williams, Mr. Charles Eugene

male

32.066493

0

0

244373

13.0000

No

S

18

19

0

3

Vander Planke, Mrs. Julius (Emelia Maria Vande...

female

31.000000

1

0

345763

18.0000

No

S

19

20

1

3

Masselmani, Mrs. Fatima

female

29.518205

0

0

2649

7.2250

No

C

20

21

0

2

Fynney, Mr. Joseph J

male

35.000000

0

0

239865

26.0000

No

S

21

22

1

2

Beesley, Mr. Lawrence

male

34.000000

0

0

248698

13.0000

Yes

S

22

23

1

3

McGowan, Miss. Anna "Annie"

female

15.000000

0

0

330923

8.0292

No

Q

23

24

1

1

Sloper, Mr. William Thompson

male

28.000000

0

0

113788

35.5000

Yes

S

24

25

0

3

Palsson, Miss. Torborg Danira

female

8.000000

3

1

349909

21.0750

No

S

25

26

1

3

Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...

female

38.000000

1

5

347077

31.3875

No

S

26

27

0

3

Emir, Mr. Farred Chehab

male

29.518205

0

0

2631

7.2250

No

C

27

28

以上是关于kaggle-泰坦尼克号Titanic-2的主要内容,如果未能解决你的问题,请参考以下文章

Kaggle经典测试,泰坦尼克号的生存预测,机器学习实验----02

Kaggle泰坦尼克-Python

Kaggle系列之预测泰坦尼克号人员的幸存与死亡(随机森林模型)

Kaggle实战入门:泰坦尼克号生还预测(进阶版)

Kaggle 泰坦尼克号

数据挖掘竞赛kaggle初战——泰坦尼克号生还预测

(c)2006-2024 SYSTEM All Rights Reserved IT常识