kaggle-泰坦尼克号Titanic-2
Posted Freeman耀
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了kaggle-泰坦尼克号Titanic-2相关的知识,希望对你有一定的参考价值。
下面我们再来看看各种舱级别情况下各性别的获救情况
1 fig = plt.figure() 2 fig.set(alpha=0.5) 3 plt.title(u"根据舱等级和性别的获救情况",fontproperties=getChineseFont()) 4 5 ax1 = fig.add_subplot(141) 6 data_train.Survived[data_train.Sex == \'female\'][data_train.Pclass != 3].value_counts().plot(kind=\'bar\', label="female highclass", color=\'#FA2479\') 7 8 ax1.set_xticklabels([\'survived\',\'unsurvived\'],rotation=0) 9 ax1.legend(["female/hight_level"], loc=\'best\') 10 11 ax2=fig.add_subplot(142, sharey=ax1) 12 data_train.Survived[data_train.Sex == \'female\'][data_train.Pclass == 3].value_counts().plot(kind=\'bar\', label=\'female, low class\', color=\'pink\') 13 ax2.set_xticklabels(["unsurvived", "survived"], rotation=0) 14 plt.legend(["female/low_level"], loc=\'best\') 15 16 ax3=fig.add_subplot(143, sharey=ax1) 17 data_train.Survived[data_train.Sex == \'male\'][data_train.Pclass != 3].value_counts().plot(kind=\'bar\', label=\'male, high class\',color=\'lightblue\') 18 ax3.set_xticklabels(["unsurvived", "survived"], rotation=0) 19 plt.legend(["male/hight_level"], loc=\'best\') 20 21 ax4=fig.add_subplot(144, sharey=ax1) 22 data_train.Survived[data_train.Sex == \'male\'][data_train.Pclass == 3].value_counts().plot(kind=\'bar\', label=\'male low class\', color=\'steelblue\') 23 ax4.set_xticklabels(["unsurvived", "survived"], rotation=0) 24 plt.legend(["male/low_level"], loc=\'best\') 25 26 plt.show()
得到下图
下面再看看大家族对结果有什么影响
1 g = data_train.groupby([\'SibSp\',\'Survived\']) 2 df = pd.DataFrame(g.count()[\'PassengerId\']) 3 4 print(df)
|
|
PassengerId |
SibSp |
Survived |
|
0 |
0 |
398 |
1 |
210 |
|
1 |
0 |
97 |
1 |
112 |
|
2 |
0 |
15 |
1 |
13 |
|
3 |
0 |
12 |
1 |
4 |
|
4 |
0 |
15 |
1 |
3 |
|
5 |
0 |
5 |
8 |
0 |
7 |
1 g = data_train.groupby([\'Parch\',\'Survived\']) 2 df = pd.DataFrame(g.count()[\'PassengerId\']) 3 print(df)
PassengerId |
||
Parch |
Survived |
|
0 |
0 |
445 |
1 |
233 |
|
1 |
0 |
53 |
1 |
65 |
|
2 |
0 |
40 |
1 |
40 |
|
3 |
0 |
2 |
1 |
3 |
|
4 |
0 |
4 |
5 |
0 |
4 |
1 |
1 |
|
6 |
0 |
1 |
基本没看出什么特殊关系,暂时作为备选特征。
ticket是船票编号,应该是unique的,和最后的结果没有太大的关系,不纳入考虑的特征范畴
cabin只有204个乘客有值,我们先看看它的一个分布
分布不均匀,应该算作类目型的,本身缺失值就多,还如此不集中,注定很棘手。如果直接按照类目特征处理,太散了,估计每个因子化后的特征都得不到什么权重。加上这么多缺失值,要不先把cabin缺失与否作为条件(虽然这部分信息缺失可能并非未登记,可能只是丢失而已,所以这样做未必妥当)。先在有无cabin信息这个粗粒度上看看Survived的情况。
1 #cabin的值计数太分散了,绝大多数Cabin值只出现一次。感觉上作为类目,加入特征未必会有效 2 #那我们一起看看这个值的有无,对于survival的分布状况,影响如何吧 3 fig = plt.figure() 4 fig.set(alpha=0.2) # 设定图表颜色alpha参数 5 6 Survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts() 7 Survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts() 8 df=pd.DataFrame({\'Notnull\':Survived_cabin, \'null\':Survived_nocabin}).transpose() 9 df.plot(kind=\'bar\', stacked=True) 10 plt.title(u"按Cabin有无看获救情况",fontproperties=getChineseFont()) 11 plt.xlabel(u"Cabin有无",fontproperties=getChineseFont()) 12 plt.ylabel(u"人数",fontproperties=getChineseFont()) 13 plt.show() 14 15 #似乎有cabin记录的乘客survival比例稍高,那先试试把这个值分为两类,有cabin值/无cabin值,一会儿加到类别特征好了
似乎有cabin的存活率高一些。
因此,我们从最明显突出的数据属性开始,cabin和age,有丢失数据对进一步研究影响较大。
Cabin:暂时按照上面分析的,按Cabin有无数据,将这个属性处理成Ye和No两种类型。
Age:对于年龄缺失,我们会有以下几种处理方法
1.如果缺失的样本占总数比例极高,可能就要直接舍弃了,作为特征加入的话,可能导致噪声的产生,影响最终结果。
2.如果缺失值样本适中,并且该属性非连续值特征属性,那就把NaN作为一个新类别,加到类别特征中。
3.如果缺失值样本适中,而该属性为连续值特征属性,有时候我们会考虑给定一个step(比如这里的age,可以考虑每隔2/3岁为一个步长),然后把它离散化之后把NaN作为一个type加到属性类目中。
4.有些情况下,缺失值个数并不多,也可以试着根据已有的值,拟合一下数据补充上。
本例中,后两种方式应该都是可行的,我们先试着补全。
我们使用scikit-learn中的RandomForest拟合一下缺失的年龄数据
1 def set_missing_ages(df): 2 \'\'\' 3 使用RandomForestClassifier填充缺失的年龄 4 :param df: 5 :return: 6 \'\'\' 7 #把已有的数值型特征取出来丢进Random Forest Regressor中 8 age_df = df[[\'Age\',\'Fare\',\'Parch\',\'SibSp\',\'Pclass\']] 9 #乘客分成已知年龄和未知年龄两部分 10 known_age = age_df[age_df.Age.notnull()].as_matrix() 11 unknown_age = age_df[age_df.Age.isnull()].as_matrix() 12 13 y = known_age[:,0]#y即目标年龄 14 X = known_age[:,1:]#X即特征属性值 15 16 rfr = RandomForestRegressor(random_state=0,n_estimators=2000,n_jobs=-1) 17 rfr.fit(X,y) 18 19 predictedAges = rfr.predict(unknown_age[:,1::]) 20 df.loc[(df.Age.isnull()),\'Age\'] = predictedAges 21 return df,rfr 22 23 24 def set_Cabin_type(df): 25 #有客舱信息的为Yes,无客舱信息的为No 26 df.loc[(df.Cabin.notnull()), \'Cabin\'] = "Yes" 27 df.loc[(df.Cabin.isnull()), \'Cabin\'] = "No" 28 return df 29 30 data_train, rfr = set_missing_ages(data_train) 31 data_train = set_Cabin_type(data_train) 32 print(data_train)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.000000 |
1 |
0 |
A/5 21171 |
7.2500 |
No |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.000000 |
1 |
0 |
PC 17599 |
71.2833 |
Yes |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.000000 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
No |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.000000 |
1 |
0 |
113803 |
53.1000 |
Yes |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.000000 |
0 |
0 |
373450 |
8.0500 |
No |
S |
5 |
6 |
0 |
3 |
Moran, Mr. James |
male |
23.828953 |
0 |
0 |
330877 |
8.4583 |
No |
Q |
6 |
7 |
0 |
1 |
McCarthy, Mr. Timothy J |
male |
54.000000 |
0 |
0 |
17463 |
51.8625 |
Yes |
S |
7 |
8 |
0 |
3 |
Palsson, Master. Gosta Leonard |
male |
2.000000 |
3 |
1 |
349909 |
21.0750 |
No |
S |
8 |
9 |
1 |
3 |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) |
female |
27.000000 |
0 |
2 |
347742 |
11.1333 |
No |
S |
9 |
10 |
1 |
2 |
Nasser, Mrs. Nicholas (Adele Achem) |
female |
14.000000 |
1 |
0 |
237736 |
30.0708 |
No |
C |
10 |
11 |
1 |
3 |
Sandstrom, Miss. Marguerite Rut |
female |
4.000000 |
1 |
1 |
PP 9549 |
16.7000 |
Yes |
S |
11 |
12 |
1 |
1 |
Bonnell, Miss. Elizabeth |
female |
58.000000 |
0 |
0 |
113783 |
26.5500 |
Yes |
S |
12 |
13 |
0 |
3 |
Saundercock, Mr. William Henry |
male |
20.000000 |
0 |
0 |
A/5. 2151 |
8.0500 |
No |
S |
13 |
14 |
0 |
3 |
Andersson, Mr. Anders Johan |
male |
39.000000 |
1 |
5 |
347082 |
31.2750 |
No |
S |
14 |
15 |
0 |
3 |
Vestrom, Miss. Hulda Amanda Adolfina |
female |
14.000000 |
0 |
0 |
350406 |
7.8542 |
No |
S |
15 |
16 |
1 |
2 |
Hewlett, Mrs. (Mary D Kingcome) |
female |
55.000000 |
0 |
0 |
248706 |
16.0000 |
No |
S |
16 |
17 |
0 |
3 |
Rice, Master. Eugene |
male |
2.000000 |
4 |
1 |
382652 |
29.1250 |
No |
Q |
17 |
18 |
1 |
2 |
Williams, Mr. Charles Eugene |
male |
32.066493 |
0 |
0 |
244373 |
13.0000 |
No |
S |
18 |
19 |
0 |
3 |
Vander Planke, Mrs. Julius (Emelia Maria Vande... |
female |
31.000000 |
1 |
0 |
345763 |
18.0000 |
No |
S |
19 |
20 |
1 |
3 |
Masselmani, Mrs. Fatima |
female |
29.518205 |
0 |
0 |
2649 |
7.2250 |
No |
C |
20 |
21 |
0 |
2 |
Fynney, Mr. Joseph J |
male |
35.000000 |
0 |
0 |
239865 |
26.0000 |
No |
S |
21 |
22 |
1 |
2 |
Beesley, Mr. Lawrence |
male |
34.000000 |
0 |
0 |
248698 |
13.0000 |
Yes |
S |
22 |
23 |
1 |
3 |
McGowan, Miss. Anna "Annie" |
female |
15.000000 |
0 |
0 |
330923 |
8.0292 |
No |
Q |
23 |
24 |
1 |
1 |
Sloper, Mr. William Thompson |
male |
28.000000 |
0 |
0 |
113788 |
35.5000 |
Yes |
S |
24 |
25 |
0 |
3 |
Palsson, Miss. Torborg Danira |
female |
8.000000 |
3 |
1 |
349909 |
21.0750 |
No |
S |
25 |
26 |
1 |
3 |
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... |
female |
38.000000 |
1 |
5 |
347077 |
31.3875 |
No |
S |
26 |
27 |
0 |
3 |
Emir, Mr. Farred Chehab |
male |
29.518205 |
0 |
0 |
2631 |
7.2250 |
No |
C |
27 |
28 |
以上是关于kaggle-泰坦尼克号Titanic-2的主要内容,如果未能解决你的问题,请参考以下文章
Kaggle经典测试,泰坦尼克号的生存预测,机器学习实验----02 |