实现泰坦尼克号预测源码和分析

Posted smartcat994

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了实现泰坦尼克号预测源码和分析相关的知识,希望对你有一定的参考价值。

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# fk ",
" "
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"目录 ",
"1. 提出问题(Business Understanding ) ",
"2. 理解数据(Data Understanding) ",
" * 采集数据 ",
" * 导入数据 ",
" * 查看数据集信息 ",
"3. 数据清洗(Data Preparation ) ",
" * 数据预处理 ",
" * 特征工程(Feature Engineering) ",
"4. 构建模型(Modeling) ",
"5. 模型评估(Evaluation) ",
"6. 方案实施 (Deployment) ",
" * 提交结果到Kaggle ",
" * 报告撰写 ",
" ",
" ",
" "
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"# 2.理解数据"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 2.1 采集数据"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"从Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 2.2 导入数据"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 776,
"outputs": [],
"source": [
"# 忽略警告提示 ",
"import warnings ",
"warnings.filterwarnings(‘ignore‘) ",
" ",
"#导入处理数据包 ",
"import numpy as np ",
"import pandas as pd ",
"from sklearn.model_selection import train_test_split ",
"from sklearn.linear_model import LogisticRegression"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 777,
"outputs": [
{
"name": "stdout",
"text": [
"训练数据集: (891, 12) 测试数据集: (418, 11) "
],
"output_type": "stream"
}
],
"source": [
"#导入数据 ",
"#训练数据集 ",
"train = pd.read_csv("C:/Users/Administrator/Desktop/ml/file/train.csv", sep=‘,‘, encoding = "gbk") ",
"#测试数据集 ",
"test = pd.read_csv("C:/Users/Administrator/Desktop/ml/file/test.csv", sep=‘,‘, encoding = "gbk") ",
"#这里要记住训练数据集有891条数据,方便后面从中拆分出测试数据集用于提交Kaggle结果 ",
"print (‘训练数据集:‘,train.shape,‘测试数据集:‘,test.shape)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 778,
"outputs": [
{
"name": "stdout",
"text": [
"kaggle训练数据集有多少行数据: 891 ,kaggle测试数据集有多少行数据: 418 "
],
"output_type": "stream"
}
],
"source": [
"rowNum_train=train.shape[0] ",
"rowNum_test=test.shape[0] ",
"print(‘kaggle训练数据集有多少行数据:‘,rowNum_train, ",
" ‘,kaggle测试数据集有多少行数据:‘,rowNum_test,)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 779,
"outputs": [
{
"name": "stdout",
"text": [
"合并后的数据集: (1309, 12) "
],
"output_type": "stream"
}
],
"source": [
"#合并数据集,方便同时对两个数据集进行清洗 ",
"full = train.append( test , ignore_index = True ) ",
" ",
"print (‘合并后的数据集:‘,full.shape)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 2.3 查看数据集信息"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 780,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\ 0 22.0 NaN S 7.2500 1 38.0 C85 C 71.2833 2 26.0 NaN S 7.9250 3 35.0 C123 S 53.1000 4 35.0 NaN S 8.0500 Name Parch PassengerId \\ 0 Braund, Mr. Owen Harris 0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 2 Heikkinen, Miss. Laina 0 3 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 4 Allen, Mr. William Henry 0 5 Pclass Sex SibSp Survived Ticket 0 3 male 1 0.0 A/5 21171 1 1 female 1 1.0 PC 17599 2 3 female 0 1.0 STON/O2. 3101282 3 1 female 1 1.0 113803 4 3 male 0 0.0 373450 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Cabin</th> <th>Embarked</th> <th>Fare</th> <th>Name</th> <th>Parch</th> <th>PassengerId</th> <th>Pclass</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>NaN</td> <td>S</td> <td>7.2500</td> <td>Braund, Mr. Owen Harris</td> <td>0</td> <td>1</td> <td>3</td> <td>male</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>C85</td> <td>C</td> <td>71.2833</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>0</td> <td>2</td> <td>1</td> <td>female</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>NaN</td> <td>S</td> <td>7.9250</td> <td>Heikkinen, Miss. Laina</td> <td>0</td> <td>3</td> <td>3</td> <td>female</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>C123</td> <td>S</td> <td>53.1000</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>0</td> <td>4</td> <td>1</td> <td>female</td> <td>1</td> <td>1.0</td> <td>113803</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>NaN</td> <td>S</td> <td>8.0500</td> <td>Allen, Mr. William Henry</td> <td>0</td> <td>5</td> <td>3</td> <td>male</td> <td>0</td> <td>0.0</td> <td>373450</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 780
}
],
"source": [
"#查看数据 ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 781,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Pclass \\ count 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 mean 29.881138 33.295479 0.385027 655.000000 2.294882 std 14.413493 51.758668 0.865560 378.020061 0.837836 min 0.170000 0.000000 0.000000 1.000000 1.000000 25% 21.000000 7.895800 0.000000 328.000000 2.000000 50% 28.000000 14.454200 0.000000 655.000000 3.000000 75% 39.000000 31.275000 0.000000 982.000000 3.000000 max 80.000000 512.329200 9.000000 1309.000000 3.000000 SibSp Survived count 1309.000000 891.000000 mean 0.498854 0.383838 std 1.041658 0.486592 min 0.000000 0.000000 25% 0.000000 0.000000 50% 0.000000 0.000000 75% 1.000000 1.000000 max 8.000000 1.000000 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Fare</th> <th>Parch</th> <th>PassengerId</th> <th>Pclass</th> <th>SibSp</th> <th>Survived</th> </tr> </thead> <tbody> <tr> <td>count</td> <td>1046.000000</td> <td>1308.000000</td> <td>1309.000000</td> <td>1309.000000</td> <td>1309.000000</td> <td>1309.000000</td> <td>891.000000</td> </tr> <tr> <td>mean</td> <td>29.881138</td> <td>33.295479</td> <td>0.385027</td> <td>655.000000</td> <td>2.294882</td> <td>0.498854</td> <td>0.383838</td> </tr> <tr> <td>std</td> <td>14.413493</td> <td>51.758668</td> <td>0.865560</td> <td>378.020061</td> <td>0.837836</td> <td>1.041658</td> <td>0.486592</td> </tr> <tr> <td>min</td> <td>0.170000</td> <td>0.000000</td> <td>0.000000</td> <td>1.000000</td> <td>1.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <td>25%</td> <td>21.000000</td> <td>7.895800</td> <td>0.000000</td> <td>328.000000</td> <td>2.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <td>50%</td> <td>28.000000</td> <td>14.454200</td> <td>0.000000</td> <td>655.000000</td> <td>3.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <td>75%</td> <td>39.000000</td> <td>31.275000</td> <td>0.000000</td> <td>982.000000</td> <td>3.000000</td> <td>1.000000</td> <td>1.000000</td> </tr> <tr> <td>max</td> <td>80.000000</td> <td>512.329200</td> <td>9.000000</td> <td>1309.000000</td> <td>3.000000</td> <td>8.000000</td> <td>1.000000</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 781
}
],
"source": [
"‘‘‘ ",
"describe只能查看数据类型的描述统计信息,对于其他类型的数据不显示,比如字符串类型姓名(name),客舱号(Cabin) ",
"这很好理解,因为描述统计指标是计算数值,所以需要该列的数据类型是数据 ",
"‘‘‘ ",
"#获取数据类型列的描述统计信息 ",
"full.describe()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 782,
"outputs": [
{
"name": "stdout",
"text": [
"<class ‘pandas.core.frame.DataFrame‘> ",
"RangeIndex: 1309 entries, 0 to 1308 ",
"Data columns (total 12 columns): ",
"Age 1046 non-null float64 ",
"Cabin 295 non-null object ",
"Embarked 1307 non-null object ",
"Fare 1308 non-null float64 ",
"Name 1309 non-null object ",
"Parch 1309 non-null int64 ",
"PassengerId 1309 non-null int64 ",
"Pclass 1309 non-null int64 ",
"Sex 1309 non-null object ",
"SibSp 1309 non-null int64 ",
"Survived 891 non-null float64 ",
"Ticket 1309 non-null object ",
"dtypes: float64(3), int64(4), object(5) ",
"memory usage: 122.8+ KB "
],
"output_type": "stream"
},
{
"data": {
"text/plain": "‘\\n我们发现数据总共有1309行。\\n其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据:\\n1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%\\n2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据\\n\\n字符串列:\\n1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少\\n2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大\\n这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。\\n‘"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 782
}
],
"source": [
"# 查看每一列的数据类型,和数据总数 ",
"full.info() ",
"‘‘‘ ",
"我们发现数据总共有1309行。 ",
"其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据: ",
"1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20% ",
"2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据 ",
" ",
"字符串列: ",
"1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少 ",
"2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大 ",
"这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。 ",
"‘‘‘"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"# 3.数据清洗(Data Preparation )"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 3.1 数据预处理"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### 缺失值处理"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"在前面,理解数据阶段,我们发现数据总共有1309行。 ",
"其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据。 ",
"字符串列:登船港口(Embarked)、船舱号(Cabin)里面有缺失数据。 ",
" ",
"这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。 ",
" ",
"很多机器学习算法为了训练模型,要求所传入的特征中不能有空值。 ",
" ",
" ",
"1. 如果是数值类型,用平均值取代 ",
"2. 如果是分类数据,用最常见的类别取代 ",
"3. 使用模型预测缺失值,例如:K-NN"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 783,
"outputs": [
{
"name": "stdout",
"text": [
"处理前: ",
"<class ‘pandas.core.frame.DataFrame‘> ",
"RangeIndex: 1309 entries, 0 to 1308 ",
"Data columns (total 12 columns): ",
"Age 1046 non-null float64 ",
"Cabin 295 non-null object ",
"Embarked 1307 non-null object ",
"Fare 1308 non-null float64 ",
"Name 1309 non-null object ",
"Parch 1309 non-null int64 ",
"PassengerId 1309 non-null int64 ",
"Pclass 1309 non-null int64 ",
"Sex 1309 non-null object ",
"SibSp 1309 non-null int64 ",
"Survived 891 non-null float64 ",
"Ticket 1309 non-null object ",
"dtypes: float64(3), int64(4), object(5) ",
"memory usage: 122.8+ KB ",
"处理红后: ",
"<class ‘pandas.core.frame.DataFrame‘> ",
"RangeIndex: 1309 entries, 0 to 1308 ",
"Data columns (total 12 columns): ",
"Age 1309 non-null float64 ",
"Cabin 295 non-null object ",
"Embarked 1307 non-null object ",
"Fare 1309 non-null float64 ",
"Name 1309 non-null object ",
"Parch 1309 non-null int64 ",
"PassengerId 1309 non-null int64 ",
"Pclass 1309 non-null int64 ",
"Sex 1309 non-null object ",
"SibSp 1309 non-null int64 ",
"Survived 891 non-null float64 ",
"Ticket 1309 non-null object ",
"dtypes: float64(3), int64(4), object(5) ",
"memory usage: 122.8+ KB "
],
"output_type": "stream"
}
],
"source": [
"‘‘‘ ",
"我们发现数据总共有1309行。 ",
"其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据: ",
"1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20% ",
"2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据 ",
" ",
"对于数据类型,处理缺失值最简单的方法就是用平均数来填充缺失值 ",
"‘‘‘ ",
"print(‘处理前:‘) ",
"full.info() ",
"#年龄(Age) ",
"full[‘Age‘]=full[‘Age‘].fillna( full[‘Age‘].mean() ) ",
"#船票价格(Fare) ",
"full[‘Fare‘] = full[‘Fare‘].fillna( full[‘Fare‘].mean() ) ",
"print(‘处理红后:‘) ",
"full.info()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 784,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\ 0 22.0 NaN S 7.2500 1 38.0 C85 C 71.2833 2 26.0 NaN S 7.9250 3 35.0 C123 S 53.1000 4 35.0 NaN S 8.0500 Name Parch PassengerId \\ 0 Braund, Mr. Owen Harris 0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 2 Heikkinen, Miss. Laina 0 3 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 4 Allen, Mr. William Henry 0 5 Pclass Sex SibSp Survived Ticket 0 3 male 1 0.0 A/5 21171 1 1 female 1 1.0 PC 17599 2 3 female 0 1.0 STON/O2. 3101282 3 1 female 1 1.0 113803 4 3 male 0 0.0 373450 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Cabin</th> <th>Embarked</th> <th>Fare</th> <th>Name</th> <th>Parch</th> <th>PassengerId</th> <th>Pclass</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>NaN</td> <td>S</td> <td>7.2500</td> <td>Braund, Mr. Owen Harris</td> <td>0</td> <td>1</td> <td>3</td> <td>male</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>C85</td> <td>C</td> <td>71.2833</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>0</td> <td>2</td> <td>1</td> <td>female</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>NaN</td> <td>S</td> <td>7.9250</td> <td>Heikkinen, Miss. Laina</td> <td>0</td> <td>3</td> <td>3</td> <td>female</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>C123</td> <td>S</td> <td>53.1000</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>0</td> <td>4</td> <td>1</td> <td>female</td> <td>1</td> <td>1.0</td> <td>113803</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>NaN</td> <td>S</td> <td>8.0500</td> <td>Allen, Mr. William Henry</td> <td>0</td> <td>5</td> <td>3</td> <td>male</td> <td>0</td> <td>0.0</td> <td>373450</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 784
}
],
"source": [
"#检查数据处理是否正常 ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 785,
"outputs": [
{
"data": {
"text/plain": "0 S 1 C 2 S 3 S 4 S Name: Embarked, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 785
}
],
"source": [
"‘‘‘ ",
"总数据是1309 ",
"字符串列: ",
"1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少 ",
"2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大 ",
"‘‘‘ ",
"#登船港口(Embarked):查看里面数据长啥样 ",
"‘‘‘ ",
"出发地点:S=英国南安普顿Southampton ",
"途径地点1:C=法国 瑟堡市Cherbourg ",
"途径地点2:Q=爱尔兰 昆士敦Queenstown ",
"‘‘‘ ",
"full[‘Embarked‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 786,
"outputs": [
{
"data": {
"text/plain": "S 914 C 270 Q 123 Name: Embarked, dtype: int64"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 786
}
],
"source": [
"‘‘‘ ",
"分类变量Embarked,看下最常见的类别,用其填充 ",
"‘‘‘ ",
"full[‘Embarked‘].value_counts()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 787,
"outputs": [],
"source": [
"‘‘‘ ",
"从结果来看,S类别最常见。我们将缺失值填充为最频繁出现的值: ",
"S=英国南安普顿Southampton ",
"‘‘‘ ",
"full[‘Embarked‘] = full[‘Embarked‘].fillna( ‘S‘ )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 788,
"outputs": [
{
"data": {
"text/plain": "0 NaN 1 C85 2 NaN 3 C123 4 NaN Name: Cabin, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 788
}
],
"source": [
"#船舱号(Cabin):查看里面数据长啥样 ",
"full[‘Cabin‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 789,
"outputs": [],
"source": [
"#缺失数据比较多,船舱号(Cabin)缺失值填充为U,表示未知(Uknow) ",
"full[‘Cabin‘] = full[‘Cabin‘].fillna( ‘U‘ )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 790,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\ 0 22.0 U S 7.2500 1 38.0 C85 C 71.2833 2 26.0 U S 7.9250 3 35.0 C123 S 53.1000 4 35.0 U S 8.0500 Name Parch PassengerId \\ 0 Braund, Mr. Owen Harris 0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 2 Heikkinen, Miss. Laina 0 3 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 4 Allen, Mr. William Henry 0 5 Pclass Sex SibSp Survived Ticket 0 3 male 1 0.0 A/5 21171 1 1 female 1 1.0 PC 17599 2 3 female 0 1.0 STON/O2. 3101282 3 1 female 1 1.0 113803 4 3 male 0 0.0 373450 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Cabin</th> <th>Embarked</th> <th>Fare</th> <th>Name</th> <th>Parch</th> <th>PassengerId</th> <th>Pclass</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>U</td> <td>S</td> <td>7.2500</td> <td>Braund, Mr. Owen Harris</td> <td>0</td> <td>1</td> <td>3</td> <td>male</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>C85</td> <td>C</td> <td>71.2833</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>0</td> <td>2</td> <td>1</td> <td>female</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>U</td> <td>S</td> <td>7.9250</td> <td>Heikkinen, Miss. Laina</td> <td>0</td> <td>3</td> <td>3</td> <td>female</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>C123</td> <td>S</td> <td>53.1000</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>0</td> <td>4</td> <td>1</td> <td>female</td> <td>1</td> <td>1.0</td> <td>113803</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>U</td> <td>S</td> <td>8.0500</td> <td>Allen, Mr. William Henry</td> <td>0</td> <td>5</td> <td>3</td> <td>male</td> <td>0</td> <td>0.0</td> <td>373450</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 790
}
],
"source": [
"#检查数据处理是否正常 ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 791,
"outputs": [
{
"name": "stdout",
"text": [
"<class ‘pandas.core.frame.DataFrame‘> ",
"RangeIndex: 1309 entries, 0 to 1308 ",
"Data columns (total 12 columns): ",
"Age 1309 non-null float64 ",
"Cabin 1309 non-null object ",
"Embarked 1309 non-null object ",
"Fare 1309 non-null float64 ",
"Name 1309 non-null object ",
"Parch 1309 non-null int64 ",
"PassengerId 1309 non-null int64 ",
"Pclass 1309 non-null int64 ",
"Sex 1309 non-null object ",
"SibSp 1309 non-null int64 ",
"Survived 891 non-null float64 ",
"Ticket 1309 non-null object ",
"dtypes: float64(3), int64(4), object(5) ",
"memory usage: 122.8+ KB "
],
"output_type": "stream"
}
],
"source": [
"#查看最终缺失值处理情况,记住生成情况(Survived)这里一列是我们的标签,用来做机器学习预测的,不需要处理这一列 ",
"full.info()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 3.2 特征提取"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### 3.2.1数据分类"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"查看数据类型,分为3种数据类型。并对类别数据处理:用数值代替类别,并进行One-hot编码"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 792,
"outputs": [
{
"name": "stdout",
"text": [
"<class ‘pandas.core.frame.DataFrame‘> ",
"RangeIndex: 1309 entries, 0 to 1308 ",
"Data columns (total 12 columns): ",
"Age 1309 non-null float64 ",
"Cabin 1309 non-null object ",
"Embarked 1309 non-null object ",
"Fare 1309 non-null float64 ",
"Name 1309 non-null object ",
"Parch 1309 non-null int64 ",
"PassengerId 1309 non-null int64 ",
"Pclass 1309 non-null int64 ",
"Sex 1309 non-null object ",
"SibSp 1309 non-null int64 ",
"Survived 891 non-null float64 ",
"Ticket 1309 non-null object ",
"dtypes: float64(3), int64(4), object(5) ",
"memory usage: 122.8+ KB "
],
"output_type": "stream"
}
],
"source": [
"‘‘‘ ",
"1.数值类型: ",
"乘客编号(PassengerId),年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch) ",
"2.时间序列:无 ",
"3.分类数据: ",
"1)有直接类别的 ",
"乘客性别(Sex):男性male,女性female ",
"登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown ",
"客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱 ",
"2)字符串类型:可能从这里面提取出特征来,也归到分类数据中 ",
"乘客姓名(Name) ",
"客舱号(Cabin) ",
"船票编号(Ticket) ",
"‘‘‘ ",
"full.info()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 3.2.1 分类数据:有直接类别的"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"1. 乘客性别(Sex): ",
"男性male,女性female ",
"2. 登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown ",
"3. 客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### 性别"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 793,
"outputs": [
{
"data": {
"text/plain": "0 male 1 female 2 female 3 female 4 male Name: Sex, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 793
}
],
"source": [
"#查看性别数据这一列 ",
"full[‘Sex‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 794,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\ 0 22.0 U S 7.2500 1 38.0 C85 C 71.2833 2 26.0 U S 7.9250 3 35.0 C123 S 53.1000 4 35.0 U S 8.0500 Name Parch PassengerId \\ 0 Braund, Mr. Owen Harris 0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 2 Heikkinen, Miss. Laina 0 3 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 4 Allen, Mr. William Henry 0 5 Pclass Sex SibSp Survived Ticket 0 3 1 1 0.0 A/5 21171 1 1 0 1 1.0 PC 17599 2 3 0 0 1.0 STON/O2. 3101282 3 1 0 1 1.0 113803 4 3 1 0 0.0 373450 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Cabin</th> <th>Embarked</th> <th>Fare</th> <th>Name</th> <th>Parch</th> <th>PassengerId</th> <th>Pclass</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>U</td> <td>S</td> <td>7.2500</td> <td>Braund, Mr. Owen Harris</td> <td>0</td> <td>1</td> <td>3</td> <td>1</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>C85</td> <td>C</td> <td>71.2833</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>0</td> <td>2</td> <td>1</td> <td>0</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>U</td> <td>S</td> <td>7.9250</td> <td>Heikkinen, Miss. Laina</td> <td>0</td> <td>3</td> <td>3</td> <td>0</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>C123</td> <td>S</td> <td>53.1000</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>0</td> <td>4</td> <td>1</td> <td>0</td> <td>1</td> <td>1.0</td> <td>113803</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>U</td> <td>S</td> <td>8.0500</td> <td>Allen, Mr. William Henry</td> <td>0</td> <td>5</td> <td>3</td> <td>1</td> <td>0</td> <td>0.0</td> <td>373450</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 794
}
],
"source": [
"‘‘‘ ",
"将性别的值映射为数值 ",
"男(male)对应数值1,女(female)对应数值0 ",
"‘‘‘ ",
"sex_mapDict={‘male‘:1, ",
" ‘female‘:0} ",
"#map函数:对Series每个数据应用自定义的函数计算 ",
"full[‘Sex‘]=full[‘Sex‘].map(sex_mapDict) ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"#### 登船港口(Embarked)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 795,
"outputs": [
{
"data": {
"text/plain": "0 S 1 C 2 S 3 S 4 S Name: Embarked, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 795
}
],
"source": [
"‘‘‘ ",
"登船港口(Embarked)的值是: ",
"出发地点:S=英国南安普顿Southampton ",
"途径地点1:C=法国 瑟堡市Cherbourg ",
"途径地点2:Q=爱尔兰 昆士敦Queenstown ",
"‘‘‘ ",
"#查看该类数据内容 ",
"full[‘Embarked‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 796,
"outputs": [
{
"data": {
"text/plain": " Embarked_C Embarked_Q Embarked_S 0 0 0 1 1 1 0 0 2 0 0 1 3 0 0 1 4 0 0 1",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Embarked_C</th> <th>Embarked_Q</th> <th>Embarked_S</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>1</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>3</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>4</td> <td>0</td> <td>0</td> <td>1</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 796
}
],
"source": [
"#存放提取后的特征 ",
"embarkedDf = pd.DataFrame() ",
" ",
"‘‘‘ ",
"使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked ",
"‘‘‘ ",
"embarkedDf = pd.get_dummies( full[‘Embarked‘] , prefix=‘Embarked‘ ) ",
"embarkedDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 797,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Fare Name \\ 0 22.0 U 7.2500 Braund, Mr. Owen Harris 1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 26.0 U 7.9250 Heikkinen, Miss. Laina 3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 35.0 U 8.0500 Allen, Mr. William Henry Parch PassengerId Pclass Sex SibSp Survived Ticket \\ 0 0 1 3 1 1 0.0 A/5 21171 1 0 2 1 0 1 1.0 PC 17599 2 0 3 3 0 0 1.0 STON/O2. 3101282 3 0 4 1 0 1 1.0 113803 4 0 5 3 1 0 0.0 373450 Embarked_C Embarked_Q Embarked_S 0 0 0 1 1 1 0 0 2 0 0 1 3 0 0 1 4 0 0 1 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Cabin</th> <th>Fare</th> <th>Name</th> <th>Parch</th> <th>PassengerId</th> <th>Pclass</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> <th>Embarked_C</th> <th>Embarked_Q</th> <th>Embarked_S</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>U</td> <td>7.2500</td> <td>Braund, Mr. Owen Harris</td> <td>0</td> <td>1</td> <td>3</td> <td>1</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>C85</td> <td>71.2833</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>0</td> <td>2</td> <td>1</td> <td>0</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>U</td> <td>7.9250</td> <td>Heikkinen, Miss. Laina</td> <td>0</td> <td>3</td> <td>3</td> <td>0</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>C123</td> <td>53.1000</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>0</td> <td>4</td> <td>1</td> <td>0</td> <td>1</td> <td>1.0</td> <td>113803</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>U</td> <td>8.0500</td> <td>Allen, Mr. William Henry</td> <td>0</td> <td>5</td> <td>3</td> <td>1</td> <td>0</td> <td>0.0</td> <td>373450</td> <td>0</td> <td>0</td> <td>1</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 797
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full ",
"full = pd.concat([full,embarkedDf],axis=1) ",
" ",
"‘‘‘ ",
"因为已经使用登船港口(Embarked)进行了one-hot编码产生了它的虚拟变量(dummy variables) ",
"所以这里把登船港口(Embarked)删掉 ",
"‘‘‘ ",
"full.drop(‘Embarked‘,axis=1,inplace=True) ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"#### 客舱等级(Pclass)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 798,
"outputs": [
{
"data": {
"text/plain": " Pclass_1 Pclass_2 Pclass_3 0 0 0 1 1 1 0 0 2 0 0 1 3 1 0 0 4 0 0 1",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Pclass_1</th> <th>Pclass_2</th> <th>Pclass_3</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>1</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>3</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>4</td> <td>0</td> <td>0</td> <td>1</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 798
}
],
"source": [
"‘‘‘ ",
"客舱等级(Pclass): ",
"1=1等舱,2=2等舱,3=3等舱 ",
"‘‘‘ ",
"#存放提取后的特征 ",
"pclassDf = pd.DataFrame() ",
" ",
"#使用get_dummies进行one-hot编码,列名前缀是Pclass ",
"pclassDf = pd.get_dummies( full[‘Pclass‘] , prefix=‘Pclass‘ ) ",
"pclassDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 799,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Fare Name \\ 0 22.0 U 7.2500 Braund, Mr. Owen Harris 1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 26.0 U 7.9250 Heikkinen, Miss. Laina 3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 35.0 U 8.0500 Allen, Mr. William Henry Parch PassengerId Sex SibSp Survived Ticket Embarked_C \\ 0 0 1 1 1 0.0 A/5 21171 0 1 0 2 0 1 1.0 PC 17599 1 2 0 3 0 0 1.0 STON/O2. 3101282 0 3 0 4 0 1 1.0 113803 0 4 0 5 1 0 0.0 373450 0 Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3 0 0 1 0 0 1 1 0 0 1 0 0 2 0 1 0 0 1 3 0 1 1 0 0 4 0 1 0 0 1 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Cabin</th> <th>Fare</th> <th>Name</th> <th>Parch</th> <th>PassengerId</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> <th>Embarked_C</th> <th>Embarked_Q</th> <th>Embarked_S</th> <th>Pclass_1</th> <th>Pclass_2</th> <th>Pclass_3</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>U</td> <td>7.2500</td> <td>Braund, Mr. Owen Harris</td> <td>0</td> <td>1</td> <td>1</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>C85</td> <td>71.2833</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>0</td> <td>2</td> <td>0</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>U</td> <td>7.9250</td> <td>Heikkinen, Miss. Laina</td> <td>0</td> <td>3</td> <td>0</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>C123</td> <td>53.1000</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>0</td> <td>4</td> <td>0</td> <td>1</td> <td>1.0</td> <td>113803</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>U</td> <td>8.0500</td> <td>Allen, Mr. William Henry</td> <td>0</td> <td>5</td> <td>1</td> <td>0</td> <td>0.0</td> <td>373450</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 799
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full ",
"full = pd.concat([full,pclassDf],axis=1) ",
" ",
"#删掉客舱等级(Pclass)这一列 ",
"full.drop(‘Pclass‘,axis=1,inplace=True) ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 3.2.1 分类数据:字符串类型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"字符串类型:可能从这里面提取出特征来,也归到分类数据中,这里数据有: ",
" ",
"1. 乘客姓名(Name) ",
"2. 客舱号(Cabin) ",
"3. 船票编号(Ticket)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### 从姓名中提取头衔"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 800,
"outputs": [
{
"data": {
"text/plain": "0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry Name: Name, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 800
}
],
"source": [
"‘‘‘ ",
"查看姓名这一列长啥样 ",
"注意到在乘客名字(Name)中,有一个非常显著的特点: ",
"乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助我们进行预测。 ",
"例如: ",
"Braund, Mr. Owen Harris ",
"Heikkinen, Miss. Laina ",
"Oliva y Ocana, Dona. Fermina ",
"Peter, Master. Michael J ",
"‘‘‘ ",
"full[ ‘Name‘ ].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 801,
"outputs": [],
"source": [
"#练习从字符串中提取头衔,例如Mr ",
"#split用于字符串分割,返回一个列表 ",
"#我们看到姓名中‘Braund, Mr. Owen Harris‘,逗号前面的是“名”,逗号后面是‘头衔. 姓’ ",
"name1=‘Braund, Mr. Owen Harris‘ ",
"‘‘‘ ",
"split用于字符串按分隔符分割,返回一个列表。这里按逗号分隔字符串 ",
"也就是字符串‘Braund, Mr. Owen Harris‘被按分隔符,‘拆分成两部分[Braund,Mr. Owen Harris] ",
"你可以把返回的列表打印出来瞧瞧,这里获取到列表中元素序号为1的元素,也就是获取到头衔所在的那部分,即Mr. Owen Harris这部分 ",
"‘‘‘ ",
"#Mr. Owen Harris ",
"str1=name1.split( ‘,‘ )[1] ",
"‘‘‘ ",
"继续对字符串Mr. Owen Harris按分隔符‘.‘拆分,得到这样一个列表[Mr, Owen Harris] ",
"这里获取到列表中元素序号为0的元素,也就是获取到头衔所在的那部分Mr ",
"‘‘‘ ",
"#Mr. ",
"str2=str1.split( ‘.‘ )[0] ",
"#strip() 方法用于移除字符串头尾指定的字符(默认为空格) ",
"str3=str2.strip()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 802,
"outputs": [],
"source": [
"‘‘‘ ",
"定义函数:从姓名中获取头衔 ",
"‘‘‘ ",
"def getTitle(name): ",
" str1=name.split( ‘,‘ )[1] #Mr. Owen Harris ",
" str2=str1.split( ‘.‘ )[0]#Mr ",
" #strip() 方法用于移除字符串头尾指定的字符(默认为空格) ",
" str3=str2.strip() ",
" return str3"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 803,
"outputs": [
{
"data": {
"text/plain": " Title 0 Mr 1 Mrs 2 Miss 3 Mrs 4 Mr",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Title</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>Mr</td> </tr> <tr> <td>1</td> <td>Mrs</td> </tr> <tr> <td>2</td> <td>Miss</td> </tr> <tr> <td>3</td> <td>Mrs</td> </tr> <tr> <td>4</td> <td>Mr</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 803
}
],
"source": [
"#存放提取后的特征 ",
"titleDf = pd.DataFrame() ",
"#map函数:对Series每个数据应用自定义的函数计算 ",
"titleDf[‘Title‘] = full[‘Name‘].map(getTitle) ",
"titleDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 804,
"outputs": [
{
"data": {
"text/plain": " Master Miss Mr Mrs Officer Royalty 0 0 0 1 0 0 0 1 0 0 0 1 0 0 2 0 1 0 0 0 0 3 0 0 0 1 0 0 4 0 0 1 0 0 0",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Master</th> <th>Miss</th> <th>Mr</th> <th>Mrs</th> <th>Officer</th> <th>Royalty</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>3</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>4</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 804
}
],
"source": [
"‘‘‘ ",
"定义以下几种头衔类别: ",
"Officer政府官员 ",
"Royalty王室(皇室) ",
"Mr已婚男士 ",
"Mrs已婚妇女 ",
"Miss年轻未婚女子 ",
"Master有技能的人/教师 ",
"‘‘‘ ",
"#姓名中头衔字符串与定义头衔类别的映射关系 ",
"title_mapDict = { ",
" "Capt": "Officer", ",
" "Col": "Officer", ",
" "Major": "Officer", ",
" "Jonkheer": "Royalty", ",
" "Don": "Royalty", ",
" "Sir" : "Royalty", ",
" "Dr": "Officer", ",
" "Rev": "Officer", ",
" "the Countess":"Royalty", ",
" "Dona": "Royalty", ",
" "Mme": "Mrs", ",
" "Mlle": "Miss", ",
" "Ms": "Mrs", ",
" "Mr" : "Mr", ",
" "Mrs" : "Mrs", ",
" "Miss" : "Miss", ",
" "Master" : "Master", ",
" "Lady" : "Royalty" ",
" } ",
" ",
"#map函数:对Series每个数据应用自定义的函数计算 ",
"titleDf[‘Title‘] = titleDf[‘Title‘].map(title_mapDict) ",
" ",
"#使用get_dummies进行one-hot编码 ",
"titleDf = pd.get_dummies(titleDf[‘Title‘]) ",
"titleDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 805,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Fare Parch PassengerId Sex SibSp Survived \\ 0 22.0 U 7.2500 0 1 1 1 0.0 1 38.0 C85 71.2833 0 2 0 1 1.0 2 26.0 U 7.9250 0 3 0 0 1.0 3 35.0 C123 53.1000 0 4 0 1 1.0 4 35.0 U 8.0500 0 5 1 0 0.0 Ticket Embarked_C ... Embarked_S Pclass_1 Pclass_2 \\ 0 A/5 21171 0 ... 1 0 0 1 PC 17599 1 ... 0 1 0 2 STON/O2. 3101282 0 ... 1 0 0 3 113803 0 ... 1 1 0 4 373450 0 ... 1 0 0 Pclass_3 Master Miss Mr Mrs Officer Royalty 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 2 1 0 1 0 0 0 0 3 0 0 0 0 1 0 0 4 1 0 0 1 0 0 0 [5 rows x 21 columns]",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Cabin</th> <th>Fare</th> <th>Parch</th> <th>PassengerId</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> <th>Embarked_C</th> <th>...</th> <th>Embarked_S</th> <th>Pclass_1</th> <th>Pclass_2</th> <th>Pclass_3</th> <th>Master</th> <th>Miss</th> <th>Mr</th> <th>Mrs</th> <th>Officer</th> <th>Royalty</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>U</td> <td>7.2500</td> <td>0</td> <td>1</td> <td>1</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> <td>0</td> <td>...</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>C85</td> <td>71.2833</td> <td>0</td> <td>2</td> <td>0</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> <td>1</td> <td>...</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>U</td> <td>7.9250</td> <td>0</td> <td>3</td> <td>0</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> <td>0</td> <td>...</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>C123</td> <td>53.1000</td> <td>0</td> <td>4</td> <td>0</td> <td>1</td> <td>1.0</td> <td>113803</td> <td>0</td> <td>...</td> <td>1</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>U</td> <td>8.0500</td> <td>0</td> <td>5</td> <td>1</td> <td>0</td> <td>0.0</td> <td>373450</td> <td>0</td> <td>...</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>5 rows × 21 columns</p> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 805
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full ",
"full = pd.concat([full,titleDf],axis=1) ",
" ",
"#删掉姓名这一列 ",
"full.drop(‘Name‘,axis=1,inplace=True) ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 从客舱号中提取客舱类别"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 806,
"outputs": [
{
"name": "stdout",
"text": [
"相加后的值为 : 30 "
],
"output_type": "stream"
}
],
"source": [
"#补充知识:匿名函数 ",
"‘‘‘ ",
"python 使用 lambda 来创建匿名函数。 ",
"所谓匿名,意即不再使用 def 语句这样标准的形式定义一个函数,预防如下: ",
"lambda 参数1,参数2:函数体或者表达式 ",
"‘‘‘ ",
"# 定义匿名函数:对两个数相加 ",
"sum = lambda a,b: a + b ",
" ",
"# 调用sum函数 ",
"print ("相加后的值为 : ", sum(10,20))"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 807,
"outputs": [
{
"data": {
"text/plain": "0 U 1 C85 2 U 3 C123 4 U Name: Cabin, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 807
}
],
"source": [
"‘‘‘ ",
"客舱号的首字母是客舱的类别 ",
"‘‘‘ ",
"#查看客舱号的内容 ",
"full[‘Cabin‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 808,
"outputs": [
{
"data": {
"text/plain": " Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T \\ 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 4 0 0 0 0 0 0 0 0 Cabin_U 0 1 1 0 2 1 3 0 4 1 ",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Cabin_A</th> <th>Cabin_B</th> <th>Cabin_C</th> <th>Cabin_D</th> <th>Cabin_E</th> <th>Cabin_F</th> <th>Cabin_G</th> <th>Cabin_T</th> <th>Cabin_U</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>3</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>4</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 808
}
],
"source": [
"#存放客舱号信息 ",
"cabinDf = pd.DataFrame() ",
" ",
"‘‘‘ ",
"客场号的类别值是首字母,例如: ",
"C85 类别映射为首字母C ",
"‘‘‘ ",
"full[ ‘Cabin‘ ] = full[ ‘Cabin‘ ].map( lambda c : c[0] ) ",
" ",
"##使用get_dummies进行one-hot编码,列名前缀是Cabin ",
"cabinDf = pd.get_dummies( full[‘Cabin‘] , prefix = ‘Cabin‘ ) ",
" ",
"cabinDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 809,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Sex SibSp Survived Ticket \\ 0 22.0 7.2500 0 1 1 1 0.0 A/5 21171 1 38.0 71.2833 0 2 0 1 1.0 PC 17599 2 26.0 7.9250 0 3 0 0 1.0 STON/O2. 3101282 3 35.0 53.1000 0 4 0 1 1.0 113803 4 35.0 8.0500 0 5 1 0 0.0 373450 Embarked_C Embarked_Q ... Royalty Cabin_A Cabin_B Cabin_C Cabin_D \\ 0 0 0 ... 0 0 0 0 0 1 1 0 ... 0 0 0 1 0 2 0 0 ... 0 0 0 0 0 3 0 0 ... 0 0 0 1 0 4 0 0 ... 0 0 0 0 0 Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U 0 0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 1 3 0 0 0 0 0 4 0 0 0 0 1 [5 rows x 29 columns]",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Fare</th> <th>Parch</th> <th>PassengerId</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> <th>Embarked_C</th> <th>Embarked_Q</th> <th>...</th> <th>Royalty</th> <th>Cabin_A</th> <th>Cabin_B</th> <th>Cabin_C</th> <th>Cabin_D</th> <th>Cabin_E</th> <th>Cabin_F</th> <th>Cabin_G</th> <th>Cabin_T</th> <th>Cabin_U</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>7.2500</td> <td>0</td> <td>1</td> <td>1</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>71.2833</td> <td>0</td> <td>2</td> <td>0</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> <td>1</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>7.9250</td> <td>0</td> <td>3</td> <td>0</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>53.1000</td> <td>0</td> <td>4</td> <td>0</td> <td>1</td> <td>1.0</td> <td>113803</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>8.0500</td> <td>0</td> <td>5</td> <td>1</td> <td>0</td> <td>0.0</td> <td>373450</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> </tr> </tbody> </table> <p>5 rows × 29 columns</p> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 809
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full ",
"full = pd.concat([full,cabinDf],axis=1) ",
" ",
"#删掉客舱号这一列 ",
"full.drop(‘Cabin‘,axis=1,inplace=True) ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 建立家庭人数和家庭类别"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 810,
"outputs": [
{
"data": {
"text/plain": " FamilySize Family_Single Family_Small Family_Large 0 2 0 1 0 1 2 0 1 0 2 1 1 0 0 3 2 0 1 0 4 1 1 0 0",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>FamilySize</th> <th>Family_Single</th> <th>Family_Small</th> <th>Family_Large</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>2</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>1</td> <td>2</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>2</td> <td>1</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>3</td> <td>2</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>4</td> <td>1</td> <td>1</td> <td>0</td> <td>0</td> </tr> </tbody> </table> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 810
}
],
"source": [
"#存放家庭信息 ",
"familyDf = pd.DataFrame() ",
" ",
"‘‘‘ ",
"家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己 ",
"(因为乘客自己也是家庭成员的一个,所以这里加1) ",
"‘‘‘ ",
"familyDf[ ‘FamilySize‘ ] = full[ ‘Parch‘ ] + full[ ‘SibSp‘ ] + 1 ",
" ",
"‘‘‘ ",
"家庭类别: ",
"小家庭Family_Single:家庭人数=1 ",
"中等家庭Family_Small: 2<=家庭人数<=4 ",
"大家庭Family_Large: 家庭人数>=5 ",
"‘‘‘ ",
"#if 条件为真的时候返回if前面内容,否则返回0 ",
"familyDf[ ‘Family_Single‘ ] = familyDf[ ‘FamilySize‘ ].map( lambda s : 1 if s == 1 else 0 ) ",
"familyDf[ ‘Family_Small‘ ] = familyDf[ ‘FamilySize‘ ].map( lambda s : 1 if 2 <= s <= 4 else 0 ) ",
"familyDf[ ‘Family_Large‘ ] = familyDf[ ‘FamilySize‘ ].map( lambda s : 1 if 5 <= s else 0 ) ",
" ",
"familyDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 811,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Sex SibSp Survived Ticket \\ 0 22.0 7.2500 0 1 1 1 0.0 A/5 21171 1 38.0 71.2833 0 2 0 1 1.0 PC 17599 2 26.0 7.9250 0 3 0 0 1.0 STON/O2. 3101282 3 35.0 53.1000 0 4 0 1 1.0 113803 4 35.0 8.0500 0 5 1 0 0.0 373450 Embarked_C Embarked_Q ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T \\ 0 0 0 ... 0 0 0 0 0 1 1 0 ... 0 0 0 0 0 2 0 0 ... 0 0 0 0 0 3 0 0 ... 0 0 0 0 0 4 0 0 ... 0 0 0 0 0 Cabin_U FamilySize Family_Single Family_Small Family_Large 0 1 2 0 1 0 1 0 2 0 1 0 2 1 1 1 0 0 3 0 2 0 1 0 4 1 1 1 0 0 [5 rows x 33 columns]",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Fare</th> <th>Parch</th> <th>PassengerId</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Ticket</th> <th>Embarked_C</th> <th>Embarked_Q</th> <th>...</th> <th>Cabin_D</th> <th>Cabin_E</th> <th>Cabin_F</th> <th>Cabin_G</th> <th>Cabin_T</th> <th>Cabin_U</th> <th>FamilySize</th> <th>Family_Single</th> <th>Family_Small</th> <th>Family_Large</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>22.0</td> <td>7.2500</td> <td>0</td> <td>1</td> <td>1</td> <td>1</td> <td>0.0</td> <td>A/5 21171</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>2</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>1</td> <td>38.0</td> <td>71.2833</td> <td>0</td> <td>2</td> <td>0</td> <td>1</td> <td>1.0</td> <td>PC 17599</td> <td>1</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>2</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>2</td> <td>26.0</td> <td>7.9250</td> <td>0</td> <td>3</td> <td>0</td> <td>0</td> <td>1.0</td> <td>STON/O2. 3101282</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr> <td>3</td> <td>35.0</td> <td>53.1000</td> <td>0</td> <td>4</td> <td>0</td> <td>1</td> <td>1.0</td> <td>113803</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>2</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>4</td> <td>35.0</td> <td>8.0500</td> <td>0</td> <td>5</td> <td>1</td> <td>0</td> <td>0.0</td> <td>373450</td> <td>0</td> <td>0</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> <td>1</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>5 rows × 33 columns</p> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 811
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full ",
"full = pd.concat([full,familyDf],axis=1) ",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 812,
"outputs": [
{
"data": {
"text/plain": "(1309, 33)"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 812
}
],
"source": [
"#到现在我们已经有了这么多个特征了 ",
"full.shape"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 3.3 特征选择"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"可以学习后面的课程后,再了解特征选择的方法。但是如果你已经具备了多种机器学习算法的知识,想提前学习,可以参考这些资料: ",
" ",
"* [如何做特征工程?](http://www.csuldw.com/2015/10/24/2015-10-24%20feature%20engineering/) ",
"* [如何使用sklearn进行特征工程?](http://www.cnblogs.com/jasonfreak/p/5448385.html) ",
" ",
"* [泰坦尼克号如何进行特征选择?](https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"相关系数法:计算各个特征的相关系数"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 813,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Sex SibSp \\ Age 1.000000 0.171521 -0.130872 0.025731 0.057397 -0.190747 Fare 0.171521 1.000000 0.221522 0.031416 -0.185484 0.160224 Parch -0.130872 0.221522 1.000000 0.008942 -0.213125 0.373587 PassengerId 0.025731 0.031416 0.008942 1.000000 0.013406 -0.055224 Sex 0.057397 -0.185484 -0.213125 0.013406 1.000000 -0.109609 SibSp -0.190747 0.160224 0.373587 -0.055224 -0.109609 1.000000 Survived -0.070323 0.257307 0.081629 -0.005007 -0.543351 -0.035322 Embarked_C 0.076179 0.286241 -0.008635 0.048101 -0.066564 -0.048396 Embarked_Q -0.012718 -0.130054 -0.100943 0.011585 -0.088651 -0.048678 Embarked_S -0.059153 -0.169894 0.071881 -0.049836 0.115193 0.073709 Pclass_1 0.362587 0.599956 -0.013033 0.026495 -0.107371 -0.034256 Pclass_2 -0.014193 -0.121372 -0.010057 0.022714 -0.028862 -0.052419 Pclass_3 -0.302093 -0.419616 0.019521 -0.041544 0.116562 0.072610 Master -0.363923 0.011596 0.253482 0.002254 0.164375 0.329171 Miss -0.254146 0.092051 0.066473 -0.050027 -0.672819 0.077564 Mr 0.165476 -0.192192 -0.304780 0.014116 0.870678 -0.243104 Mrs 0.198091 0.139235 0.213491 0.033299 -0.571176 0.061643 Officer 0.162818 0.028696 -0.032631 0.002231 0.087288 -0.013813 Royalty 0.059466 0.026214 -0.030197 0.004400 -0.020408 -0.010787 Cabin_A 0.125177 0.020094 -0.030707 -0.002831 0.047561 -0.039808 Cabin_B 0.113458 0.393743 0.073051 0.015895 -0.094453 -0.011569 Cabin_C 0.167993 0.401370 0.009601 0.006092 -0.077473 0.048616 Cabin_D 0.132886 0.072737 -0.027385 0.000549 -0.057396 -0.015727 Cabin_E 0.106600 0.073949 0.001084 -0.008136 -0.040340 -0.027180 Cabin_F -0.072644 -0.037567 0.020481 0.000306 -0.006655 -0.008619 Cabin_G -0.085977 -0.022857 0.058325 -0.045949 -0.083285 0.006015 Cabin_T 0.032461 0.001179 -0.012304 -0.023049 0.020558 -0.013247 Cabin_U -0.271918 -0.507197 -0.036806 0.000208 0.137396 0.009064 FamilySize -0.196996 0.226465 0.792296 -0.031437 -0.188583 0.861952 Family_Single 0.116675 -0.274826 -0.549022 0.028546 0.284537 -0.591077 Family_Small -0.038189 0.197281 0.248532 0.002975 -0.255196 0.253590 Family_Large -0.161210 0.170853 0.624627 -0.063415 -0.077748 0.699681 Survived Embarked_C Embarked_Q Embarked_S ... Cabin_D \\ Age -0.070323 0.076179 -0.012718 -0.059153 ... 0.132886 Fare 0.257307 0.286241 -0.130054 -0.169894 ... 0.072737 Parch 0.081629 -0.008635 -0.100943 0.071881 ... -0.027385 PassengerId -0.005007 0.048101 0.011585 -0.049836 ... 0.000549 Sex -0.543351 -0.066564 -0.088651 0.115193 ... -0.057396 SibSp -0.035322 -0.048396 -0.048678 0.073709 ... -0.015727 Survived 1.000000 0.168240 0.003650 -0.149683 ... 0.150716 Embarked_C 0.168240 1.000000 -0.164166 -0.778262 ... 0.107782 Embarked_Q 0.003650 -0.164166 1.000000 -0.491656 ... -0.061459 Embarked_S -0.149683 -0.778262 -0.491656 1.000000 ... -0.056023 Pclass_1 0.285904 0.325722 -0.166101 -0.181800 ... 0.275698 Pclass_2 0.093349 -0.134675 -0.121973 0.196532 ... -0.037929 Pclass_3 -0.322308 -0.171430 0.243706 -0.003805 ... -0.207455 Master 0.085221 -0.014172 -0.009091 0.018297 ... -0.042192 Miss 0.332795 -0.014351 0.198804 -0.113886 ... -0.012516 Mr -0.549199 -0.065538 -0.080224 0.108924 ... -0.030261 Mrs 0.344935 0.098379 -0.100374 -0.022950 ... 0.080393 Officer -0.031316 0.003678 -0.003212 -0.001202 ... 0.006055 Royalty 0.033391 0.077213 -0.021853 -0.054250 ... -0.012950 Cabin_A 0.022287 0.094914 -0.042105 -0.056984 ... -0.024952 Cabin_B 0.175095 0.161595 -0.073613 -0.095790 ... -0.043624 Cabin_C 0.114652 0.158043 -0.059151 -0.101861 ... -0.053083 Cabin_D 0.150716 0.107782 -0.061459 -0.056023 ... 1.000000 Cabin_E 0.145321 0.027566 -0.042877 0.002960 ... -0.034317 Cabin_F 0.057935 -0.020010 -0.020282 0.030575 ... -0.024369 Cabin_G 0.016040 -0.031566 -0.019941 0.040560 ... -0.011817 Cabin_T -0.026456 -0.014095 -0.008904 0.018111 ... -0.005277 Cabin_U -0.316912 -0.258257 0.142369 0.137351 ... -0.353822 FamilySize 0.016639 -0.036553 -0.087190 0.087771 ... -0.025313 Family_Single -0.203367 -0.107874 0.127214 0.014246 ... -0.074310 Family_Small 0.279855 0.159594 -0.122491 -0.062909 ... 0.102432 Family_Large -0.125147 -0.092825 -0.018423 0.093671 ... -0.049336 Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize \\ Age 0.106600 -0.072644 -0.085977 0.032461 -0.271918 -0.196996 Fare 0.073949 -0.037567 -0.022857 0.001179 -0.507197 0.226465 Parch 0.001084 0.020481 0.058325 -0.012304 -0.036806 0.792296 PassengerId -0.008136 0.000306 -0.045949 -0.023049 0.000208 -0.031437 Sex -0.040340 -0.006655 -0.083285 0.020558 0.137396 -0.188583 SibSp -0.027180 -0.008619 0.006015 -0.013247 0.009064 0.861952 Survived 0.145321 0.057935 0.016040 -0.026456 -0.316912 0.016639 Embarked_C 0.027566 -0.020010 -0.031566 -0.014095 -0.258257 -0.036553 Embarked_Q -0.042877 -0.020282 -0.019941 -0.008904 0.142369 -0.087190 Embarked_S 0.002960 0.030575 0.040560 0.018111 0.137351 0.087771 Pclass_1 0.242963 -0.073083 -0.035441 0.048310 -0.776987 -0.029656 Pclass_2 -0.050210 0.127371 -0.032081 -0.014325 0.176485 -0.039976 Pclass_3 -0.169063 -0.041178 0.056964 -0.030057 0.527614 0.058430 Master 0.001860 0.058311 -0.013690 -0.006113 0.041178 0.355061 Miss 0.008700 -0.003088 0.061881 -0.013832 -0.004364 0.087350 Mr -0.032953 -0.026403 -0.072514 0.023611 0.131807 -0.326487 Mrs 0.045538 0.013376 0.042547 -0.011742 -0.162253 0.157233 Officer -0.024048 -0.017076 -0.008281 -0.003698 -0.067030 -0.026921 Royalty -0.012202 -0.008665 -0.004202 -0.001876 -0.071672 -0.023600 Cabin_A -0.023510 -0.016695 -0.008096 -0.003615 -0.242399 -0.042967 Cabin_B -0.041103 -0.029188 -0.014154 -0.006320 -0.423794 0.032318 Cabin_C -0.050016 -0.035516 -0.017224 -0.007691 -0.515684 0.037226 Cabin_D -0.034317 -0.024369 -0.011817 -0.005277 -0.353822 -0.025313 Cabin_E 1.000000 -0.022961 -0.011135 -0.004972 -0.333381 -0.017285 Cabin_F -0.022961 1.000000 -0.007907 -0.003531 -0.236733 0.005525 Cabin_G -0.011135 -0.007907 1.000000 -0.001712 -0.114803 0.035835 Cabin_T -0.004972 -0.003531 -0.001712 1.000000 -0.051263 -0.015438 Cabin_U -0.333381 -0.236733 -0.114803 -0.051263 1.000000 -0.014155 FamilySize -0.017285 0.005525 0.035835 -0.015438 -0.014155 1.000000 Family_Single -0.042535 0.004055 -0.076397 0.022411 0.175812 -0.688864 Family_Small 0.068007 0.012756 0.087471 -0.019574 -0.211367 0.302640 Family_Large -0.046485 -0.033009 -0.016008 -0.007148 0.056438 0.801623 Family_Single Family_Small Family_Large Age 0.116675 -0.038189 -0.161210 Fare -0.274826 0.197281 0.170853 Parch -0.549022 0.248532 0.624627 PassengerId 0.028546 0.002975 -0.063415 Sex 0.284537 -0.255196 -0.077748 SibSp -0.591077 0.253590 0.699681 Survived -0.203367 0.279855 -0.125147 Embarked_C -0.107874 0.159594 -0.092825 Embarked_Q 0.127214 -0.122491 -0.018423 Embarked_S 0.014246 -0.062909 0.093671 Pclass_1 -0.126551 0.165965 -0.067523 Pclass_2 -0.035075 0.097270 -0.118495 Pclass_3 0.138250 -0.223338 0.155560 Master -0.265355 0.120166 0.301809 Miss -0.023890 -0.018085 0.083422 Mr 0.386262 -0.300872 -0.194207 Mrs -0.354649 0.361247 0.012893 Officer 0.013303 0.003966 -0.034572 Royalty 0.008761 -0.000073 -0.017542 Cabin_A 0.045227 -0.029546 -0.033799 Cabin_B -0.087912 0.084268 0.013470 Cabin_C -0.137498 0.141925 0.001362 Cabin_D -0.074310 0.102432 -0.049336 Cabin_E -0.042535 0.068007 -0.046485 Cabin_F 0.004055 0.012756 -0.033009 Cabin_G -0.076397 0.087471 -0.016008 Cabin_T 0.022411 -0.019574 -0.007148 Cabin_U 0.175812 -0.211367 0.056438 FamilySize -0.688864 0.302640 0.801623 Family_Single 1.000000 -0.873398 -0.318944 Family_Small -0.873398 1.000000 -0.183007 Family_Large -0.318944 -0.183007 1.000000 [32 rows x 32 columns]",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>Fare</th> <th>Parch</th> <th>PassengerId</th> <th>Sex</th> <th>SibSp</th> <th>Survived</th> <th>Embarked_C</th> <th>Embarked_Q</th> <th>Embarked_S</th> <th>...</th> <th>Cabin_D</th> <th>Cabin_E</th> <th>Cabin_F</th> <th>Cabin_G</th> <th>Cabin_T</th> <th>Cabin_U</th> <th>FamilySize</th> <th>Family_Single</th> <th>Family_Small</th> <th>Family_Large</th> </tr> </thead> <tbody> <tr> <td>Age</td> <td>1.000000</td> <td>0.171521</td> <td>-0.130872</td> <td>0.025731</td> <td>0.057397</td> <td>-0.190747</td> <td>-0.070323</td> <td>0.076179</td> <td>-0.012718</td> <td>-0.059153</td> <td>...</td> <td>0.132886</td> <td>0.106600</td> <td>-0.072644</td> <td>-0.085977</td> <td>0.032461</td> <td>-0.271918</td> <td>-0.196996</td> <td>0.116675</td> <td>-0.038189</td> <td>-0.161210</td> </tr> <tr> <td>Fare</td> <td>0.171521</td> <td>1.000000</td> <td>0.221522</td> <td>0.031416</td> <td>-0.185484</td> <td>0.160224</td> <td>0.257307</td> <td>0.286241</td> <td>-0.130054</td> <td>-0.169894</td> <td>...</td> <td>0.072737</td> <td>0.073949</td> <td>-0.037567</td> <td>-0.022857</td> <td>0.001179</td> <td>-0.507197</td> <td>0.226465</td> <td>-0.274826</td> <td>0.197281</td> <td>0.170853</td> </tr> <tr> <td>Parch</td> <td>-0.130872</td> <td>0.221522</td> <td>1.000000</td> <td>0.008942</td> <td>-0.213125</td> <td>0.373587</td> <td>0.081629</td> <td>-0.008635</td> <td>-0.100943</td> <td>0.071881</td> <td>...</td> <td>-0.027385</td> <td>0.001084</td> <td>0.020481</td> <td>0.058325</td> <td>-0.012304</td> <td>-0.036806</td> <td>0.792296</td> <td>-0.549022</td> <td>0.248532</td> <td>0.624627</td> </tr> <tr> <td>PassengerId</td> <td>0.025731</td> <td>0.031416</td> <td>0.008942</td> <td>1.000000</td> <td>0.013406</td> <td>-0.055224</td> <td>-0.005007</td> <td>0.048101</td> <td>0.011585</td> <td>-0.049836</td> <td>...</td> <td>0.000549</td> <td>-0.008136</td> <td>0.000306</td> <td>-0.045949</td> <td>-0.023049</td> <td>0.000208</td> <td>-0.031437</td> <td>0.028546</td> <td>0.002975</td> <td>-0.063415</td> </tr> <tr> <td>Sex</td> <td>0.057397</td> <td>-0.185484</td> <td>-0.213125</td> <td>0.013406</td> <td>1.000000</td> <td>-0.109609</td> <td>-0.543351</td> <td>-0.066564</td> <td>-0.088651</td> <td>0.115193</td> <td>...</td> <td>-0.057396</td> <td>-0.040340</td> <td>-0.006655</td> <td>-0.083285</td> <td>0.020558</td> <td>0.137396</td> <td>-0.188583</td> <td>0.284537</td> <td>-0.255196</td> <td>-0.077748</td> </tr> <tr> <td>SibSp</td> <td>-0.190747</td> <td>0.160224</td> <td>0.373587</td> <td>-0.055224</td> <td>-0.109609</td> <td>1.000000</td> <td>-0.035322</td> <td>-0.048396</td> <td>-0.048678</td> <td>0.073709</td> <td>...</td> <td>-0.015727</td> <td>-0.027180</td> <td>-0.008619</td> <td>0.006015</td> <td>-0.013247</td> <td>0.009064</td> <td>0.861952</td> <td>-0.591077</td> <td>0.253590</td> <td>0.699681</td> </tr> <tr> <td>Survived</td> <td>-0.070323</td> <td>0.257307</td> <td>0.081629</td> <td>-0.005007</td> <td>-0.543351</td> <td>-0.035322</td> <td>1.000000</td> <td>0.168240</td> <td>0.003650</td> <td>-0.149683</td> <td>...</td> <td>0.150716</td> <td>0.145321</td> <td>0.057935</td> <td>0.016040</td> <td>-0.026456</td> <td>-0.316912</td> <td>0.016639</td> <td>-0.203367</td> <td>0.279855</td> <td>-0.125147</td> </tr> <tr> <td>Embarked_C</td> <td>0.076179</td> <td>0.286241</td> <td>-0.008635</td> <td>0.048101</td> <td>-0.066564</td> <td>-0.048396</td> <td>0.168240</td> <td>1.000000</td> <td>-0.164166</td> <td>-0.778262</td> <td>...</td> <td>0.107782</td> <td>0.027566</td> <td>-0.020010</td> <td>-0.031566</td> <td>-0.014095</td> <td>-0.258257</td> <td>-0.036553</td> <td>-0.107874</td> <td>0.159594</td> <td>-0.092825</td> </tr> <tr> <td>Embarked_Q</td> <td>-0.012718</td> <td>-0.130054</td> <td>-0.100943</td> <td>0.011585</td> <td>-0.088651</td> <td>-0.048678</td> <td>0.003650</td> <td>-0.164166</td> <td>1.000000</td> <td>-0.491656</td> <td>...</td> <td>-0.061459</td> <td>-0.042877</td> <td>-0.020282</td> <td>-0.019941</td> <td>-0.008904</td> <td>0.142369</td> <td>-0.087190</td> <td>0.127214</td> <td>-0.122491</td> <td>-0.018423</td> </tr> <tr> <td>Embarked_S</td> <td>-0.059153</td> <td>-0.169894</td> <td>0.071881</td> <td>-0.049836</td> <td>0.115193</td> <td>0.073709</td> <td>-0.149683</td> <td>-0.778262</td> <td>-0.491656</td> <td>1.000000</td> <td>...</td> <td>-0.056023</td> <td>0.002960</td> <td>0.030575</td> <td>0.040560</td> <td>0.018111</td> <td>0.137351</td> <td>0.087771</td> <td>0.014246</td> <td>-0.062909</td> <td>0.093671</td> </tr> <tr> <td>Pclass_1</td> <td>0.362587</td> <td>0.599956</td> <td>-0.013033</td> <td>0.026495</td> <td>-0.107371</td> <td>-0.034256</td> <td>0.285904</td> <td>0.325722</td> <td>-0.166101</td> <td>-0.181800</td> <td>...</td> <td>0.275698</td> <td>0.242963</td> <td>-0.073083</td> <td>-0.035441</td> <td>0.048310</td> <td>-0.776987</td> <td>-0.029656</td> <td>-0.126551</td> <td>0.165965</td> <td>-0.067523</td> </tr> <tr> <td>Pclass_2</td> <td>-0.014193</td> <td>-0.121372</td> <td>-0.010057</td> <td>0.022714</td> <td>-0.028862</td> <td>-0.052419</td> <td>0.093349</td> <td>-0.134675</td> <td>-0.121973</td> <td>0.196532</td> <td>...</td> <td>-0.037929</td> <td>-0.050210</td> <td>0.127371</td> <td>-0.032081</td> <td>-0.014325</td> <td>0.176485</td> <td>-0.039976</td> <td>-0.035075</td> <td>0.097270</td> <td>-0.118495</td> </tr> <tr> <td>Pclass_3</td> <td>-0.302093</td> <td>-0.419616</td> <td>0.019521</td> <td>-0.041544</td> <td>0.116562</td> <td>0.072610</td> <td>-0.322308</td> <td>-0.171430</td> <td>0.243706</td> <td>-0.003805</td> <td>...</td> <td>-0.207455</td> <td>-0.169063</td> <td>-0.041178</td> <td>0.056964</td> <td>-0.030057</td> <td>0.527614</td> <td>0.058430</td> <td>0.138250</td> <td>-0.223338</td> <td>0.155560</td> </tr> <tr> <td>Master</td> <td>-0.363923</td> <td>0.011596</td> <td>0.253482</td> <td>0.002254</td> <td>0.164375</td> <td>0.329171</td> <td>0.085221</td> <td>-0.014172</td> <td>-0.009091</td> <td>0.018297</td> <td>...</td> <td>-0.042192</td> <td>0.001860</td> <td>0.058311</td> <td>-0.013690</td> <td>-0.006113</td> <td>0.041178</td> <td>0.355061</td> <td>-0.265355</td> <td>0.120166</td> <td>0.301809</td> </tr> <tr> <td>Miss</td> <td>-0.254146</td> <td>0.092051</td> <td>0.066473</td> <td>-0.050027</td> <td>-0.672819</td> <td>0.077564</td> <td>0.332795</td> <td>-0.014351</td> <td>0.198804</td> <td>-0.113886</td> <td>...</td> <td>-0.012516</td> <td>0.008700</td> <td>-0.003088</td> <td>0.061881</td> <td>-0.013832</td> <td>-0.004364</td> <td>0.087350</td> <td>-0.023890</td> <td>-0.018085</td> <td>0.083422</td> </tr> <tr> <td>Mr</td> <td>0.165476</td> <td>-0.192192</td> <td>-0.304780</td> <td>0.014116</td> <td>0.870678</td> <td>-0.243104</td> <td>-0.549199</td> <td>-0.065538</td> <td>-0.080224</td> <td>0.108924</td> <td>...</td> <td>-0.030261</td> <td>-0.032953</td> <td>-0.026403</td> <td>-0.072514</td> <td>0.023611</td> <td>0.131807</td> <td>-0.326487</td> <td>0.386262</td> <td>-0.300872</td> <td>-0.194207</td> </tr> <tr> <td>Mrs</td> <td>0.198091</td> <td>0.139235</td> <td>0.213491</td> <td>0.033299</td> <td>-0.571176</td> <td>0.061643</td> <td>0.344935</td> <td>0.098379</td> <td>-0.100374</td> <td>-0.022950</td> <td>...</td> <td>0.080393</td> <td>0.045538</td> <td>0.013376</td> <td>0.042547</td> <td>-0.011742</td> <td>-0.162253</td> <td>0.157233</td> <td>-0.354649</td> <td>0.361247</td> <td>0.012893</td> </tr> <tr> <td>Officer</td> <td>0.162818</td> <td>0.028696</td> <td>-0.032631</td> <td>0.002231</td> <td>0.087288</td> <td>-0.013813</td> <td>-0.031316</td> <td>0.003678</td> <td>-0.003212</td> <td>-0.001202</td> <td>...</td> <td>0.006055</td> <td>-0.024048</td> <td>-0.017076</td> <td>-0.008281</td> <td>-0.003698</td> <td>-0.067030</td> <td>-0.026921</td> <td>0.013303</td> <td>0.003966</td> <td>-0.034572</td> </tr> <tr> <td>Royalty</td> <td>0.059466</td> <td>0.026214</td> <td>-0.030197</td> <td>0.004400</td> <td>-0.020408</td> <td>-0.010787</td> <td>0.033391</td> <td>0.077213</td> <td>-0.021853</td> <td>-0.054250</td> <td>...</td> <td>-0.012950</td> <td>-0.012202</td> <td>-0.008665</td> <td>-0.004202</td> <td>-0.001876</td> <td>-0.071672</td> <td>-0.023600</td> <td>0.008761</td> <td>-0.000073</td> <td>-0.017542</td> </tr> <tr> <td>Cabin_A</td> <td>0.125177</td> <td>0.020094</td> <td>-0.030707</td> <td>-0.002831</td> <td>0.047561</td> <td>-0.039808</td> <td>0.022287</td> <td>0.094914</td> <td>-0.042105</td> <td>-0.056984</td> <td>...</td> <td>-0.024952</td> <td>-0.023510</td> <td>-0.016695</td> <td>-0.008096</td> <td>-0.003615</td> <td>-0.242399</td> <td>-0.042967</td> <td>0.045227</td> <td>-0.029546</td> <td>-0.033799</td> </tr> <tr> <td>Cabin_B</td> <td>0.113458</td> <td>0.393743</td> <td>0.073051</td> <td>0.015895</td> <td>-0.094453</td> <td>-0.011569</td> <td>0.175095</td> <td>0.161595</td> <td>-0.073613</td> <td>-0.095790</td> <td>...</td> <td>-0.043624</td> <td>-0.041103</td> <td>-0.029188</td> <td>-0.014154</td> <td>-0.006320</td> <td>-0.423794</td> <td>0.032318</td> <td>-0.087912</td> <td>0.084268</td> <td>0.013470</td> </tr> <tr> <td>Cabin_C</td> <td>0.167993</td> <td>0.401370</td> <td>0.009601</td> <td>0.006092</td> <td>-0.077473</td> <td>0.048616</td> <td>0.114652</td> <td>0.158043</td> <td>-0.059151</td> <td>-0.101861</td> <td>...</td> <td>-0.053083</td> <td>-0.050016</td> <td>-0.035516</td> <td>-0.017224</td> <td>-0.007691</td> <td>-0.515684</td> <td>0.037226</td> <td>-0.137498</td> <td>0.141925</td> <td>0.001362</td> </tr> <tr> <td>Cabin_D</td> <td>0.132886</td> <td>0.072737</td> <td>-0.027385</td> <td>0.000549</td> <td>-0.057396</td> <td>-0.015727</td> <td>0.150716</td> <td>0.107782</td> <td>-0.061459</td> <td>-0.056023</td> <td>...</td> <td>1.000000</td> <td>-0.034317</td> <td>-0.024369</td> <td>-0.011817</td> <td>-0.005277</td> <td>-0.353822</td> <td>-0.025313</td> <td>-0.074310</td> <td>0.102432</td> <td>-0.049336</td> </tr> <tr> <td>Cabin_E</td> <td>0.106600</td> <td>0.073949</td> <td>0.001084</td> <td>-0.008136</td> <td>-0.040340</td> <td>-0.027180</td> <td>0.145321</td> <td>0.027566</td> <td>-0.042877</td> <td>0.002960</td> <td>...</td> <td>-0.034317</td> <td>1.000000</td> <td>-0.022961</td> <td>-0.011135</td> <td>-0.004972</td> <td>-0.333381</td> <td>-0.017285</td> <td>-0.042535</td> <td>0.068007</td> <td>-0.046485</td> </tr> <tr> <td>Cabin_F</td> <td>-0.072644</td> <td>-0.037567</td> <td>0.020481</td> <td>0.000306</td> <td>-0.006655</td> <td>-0.008619</td> <td>0.057935</td> <td>-0.020010</td> <td>-0.020282</td> <td>0.030575</td> <td>...</td> <td>-0.024369</td> <td>-0.022961</td> <td>1.000000</td> <td>-0.007907</td> <td>-0.003531</td> <td>-0.236733</td> <td>0.005525</td> <td>0.004055</td> <td>0.012756</td> <td>-0.033009</td> </tr> <tr> <td>Cabin_G</td> <td>-0.085977</td> <td>-0.022857</td> <td>0.058325</td> <td>-0.045949</td> <td>-0.083285</td> <td>0.006015</td> <td>0.016040</td> <td>-0.031566</td> <td>-0.019941</td> <td>0.040560</td> <td>...</td> <td>-0.011817</td> <td>-0.011135</td> <td>-0.007907</td> <td>1.000000</td> <td>-0.001712</td> <td>-0.114803</td> <td>0.035835</td> <td>-0.076397</td> <td>0.087471</td> <td>-0.016008</td> </tr> <tr> <td>Cabin_T</td> <td>0.032461</td> <td>0.001179</td> <td>-0.012304</td> <td>-0.023049</td> <td>0.020558</td> <td>-0.013247</td> <td>-0.026456</td> <td>-0.014095</td> <td>-0.008904</td> <td>0.018111</td> <td>...</td> <td>-0.005277</td> <td>-0.004972</td> <td>-0.003531</td> <td>-0.001712</td> <td>1.000000</td> <td>-0.051263</td> <td>-0.015438</td> <td>0.022411</td> <td>-0.019574</td> <td>-0.007148</td> </tr> <tr> <td>Cabin_U</td> <td>-0.271918</td> <td>-0.507197</td> <td>-0.036806</td> <td>0.000208</td> <td>0.137396</td> <td>0.009064</td> <td>-0.316912</td> <td>-0.258257</td> <td>0.142369</td> <td>0.137351</td> <td>...</td> <td>-0.353822</td> <td>-0.333381</td> <td>-0.236733</td> <td>-0.114803</td> <td>-0.051263</td> <td>1.000000</td> <td>-0.014155</td> <td>0.175812</td> <td>-0.211367</td> <td>0.056438</td> </tr> <tr> <td>FamilySize</td> <td>-0.196996</td> <td>0.226465</td> <td>0.792296</td> <td>-0.031437</td> <td>-0.188583</td> <td>0.861952</td> <td>0.016639</td> <td>-0.036553</td> <td>-0.087190</td> <td>0.087771</td> <td>...</td> <td>-0.025313</td> <td>-0.017285</td> <td>0.005525</td> <td>0.035835</td> <td>-0.015438</td> <td>-0.014155</td> <td>1.000000</td> <td>-0.688864</td> <td>0.302640</td> <td>0.801623</td> </tr> <tr> <td>Family_Single</td> <td>0.116675</td> <td>-0.274826</td> <td>-0.549022</td> <td>0.028546</td> <td>0.284537</td> <td>-0.591077</td> <td>-0.203367</td> <td>-0.107874</td> <td>0.127214</td> <td>0.014246</td> <td>...</td> <td>-0.074310</td> <td>-0.042535</td> <td>0.004055</td> <td>-0.076397</td> <td>0.022411</td> <td>0.175812</td> <td>-0.688864</td> <td>1.000000</td> <td>-0.873398</td> <td>-0.318944</td> </tr> <tr> <td>Family_Small</td> <td>-0.038189</td> <td>0.197281</td> <td>0.248532</td> <td>0.002975</td> <td>-0.255196</td> <td>0.253590</td> <td>0.279855</td> <td>0.159594</td> <td>-0.122491</td> <td>-0.062909</td> <td>...</td> <td>0.102432</td> <td>0.068007</td> <td>0.012756</td> <td>0.087471</td> <td>-0.019574</td> <td>-0.211367</td> <td>0.302640</td> <td>-0.873398</td> <td>1.000000</td> <td>-0.183007</td> </tr> <tr> <td>Family_Large</td> <td>-0.161210</td> <td>0.170853</td> <td>0.624627</td> <td>-0.063415</td> <td>-0.077748</td> <td>0.699681</td> <td>-0.125147</td> <td>-0.092825</td> <td>-0.018423</td> <td>0.093671</td> <td>...</td> <td>-0.049336</td> <td>-0.046485</td> <td>-0.033009</td> <td>-0.016008</td> <td>-0.007148</td> <td>0.056438</td> <td>0.801623</td> <td>-0.318944</td> <td>-0.183007</td> <td>1.000000</td> </tr> </tbody> </table> <p>32 rows × 32 columns</p> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 813
}
],
"source": [
"#相关性矩阵 ",
"corrDf = full.corr() ",
"corrDf"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 814,
"outputs": [
{
"data": {
"text/plain": "Survived 1.000000 Mrs 0.344935 Miss 0.332795 Pclass_1 0.285904 Family_Small 0.279855 Fare 0.257307 Cabin_B 0.175095 Embarked_C 0.168240 Cabin_D 0.150716 Cabin_E 0.145321 Cabin_C 0.114652 Pclass_2 0.093349 Master 0.085221 Parch 0.081629 Cabin_F 0.057935 Royalty 0.033391 Cabin_A 0.022287 FamilySize 0.016639 Cabin_G 0.016040 Embarked_Q 0.003650 PassengerId -0.005007 Cabin_T -0.026456 Officer -0.031316 SibSp -0.035322 Age -0.070323 Family_Large -0.125147 Embarked_S -0.149683 Family_Single -0.203367 Cabin_U -0.316912 Pclass_3 -0.322308 Sex -0.543351 Mr -0.549199 Name: Survived, dtype: float64"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 814
}
],
"source": [
"‘‘‘ ",
"查看各个特征与生成情况(Survived)的相关系数, ",
"ascending=False表示按降序排列 ",
"‘‘‘ ",
"corrDf[‘Survived‘].sort_values(ascending =False)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"根据各个特征与生成情况(Survived)的相关系数大小,我们选择了这几个特征作为模型的输入: ",
" ",
"头衔(前面所在的数据集titleDf)、客舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 815,
"outputs": [
{
"data": {
"text/plain": " Master Miss Mr Mrs Officer Royalty Pclass_1 Pclass_2 Pclass_3 \\ 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 2 0 1 0 0 0 0 0 0 1 3 0 0 0 1 0 0 1 0 0 4 0 0 1 0 0 0 0 0 1 FamilySize ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U \\ 0 2 ... 0 0 0 0 0 1 1 2 ... 0 0 0 0 0 0 2 1 ... 0 0 0 0 0 1 3 2 ... 0 0 0 0 0 0 4 1 ... 0 0 0 0 0 1 Embarked_C Embarked_Q Embarked_S Sex 0 0 0 1 1 1 1 0 0 0 2 0 0 1 0 3 0 0 1 0 4 0 0 1 1 [5 rows x 27 columns]",
"text/html": "<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Master</th> <th>Miss</th> <th>Mr</th> <th>Mrs</th> <th>Officer</th> <th>Royalty</th> <th>Pclass_1</th> <th>Pclass_2</th> <th>Pclass_3</th> <th>FamilySize</th> <th>...</th> <th>Cabin_D</th> <th>Cabin_E</th> <th>Cabin_F</th> <th>Cabin_G</th> <th>Cabin_T</th> <th>Cabin_U</th> <th>Embarked_C</th> <th>Embarked_Q</th> <th>Embarked_S</th> <th>Sex</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>2</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> </tr> <tr> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>2</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>2</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>3</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>2</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> </tr> <tr> <td>4</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> <td>...</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> </tr> </tbody> </table> <p>5 rows × 27 columns</p> </div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 815
}
],
"source": [
"#特征选择 ",
"full_X = pd.concat( [titleDf,#头衔 ",
" pclassDf,#客舱等级 ",
" familyDf,#家庭大小 ",
" full[‘Fare‘],#船票价格 ",
" cabinDf,#船舱号 ",
" embarkedDf,#登船港口 ",
" full[‘Sex‘]#性别 ",
" ] , axis=1 ) ",
"full_X.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"# 4.构建模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 4.1 建立训练数据集和测试数据集"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 816,
"outputs": [],
"source": [
"‘‘‘ ",
"1)坦尼克号测试数据集因为是我们最后要提交给Kaggle的,里面没有生存情况的值,所以不能用于评估模型。 ",
"我们将Kaggle泰坦尼克号项目给我们的测试数据,叫做预测数据集(记为pred,也就是预测英文单词predict的缩写)。 ",
"也就是我们使用机器学习模型来对其生存情况就那些预测。 ",
"2)我们使用Kaggle泰坦尼克号项目给的训练数据集,做为我们的原始数据集(记为source), ",
"从这个原始数据集中拆分出训练数据集(记为train:用于模型训练)和测试数据集(记为test:用于模型评估)。 ",
" ",
"‘‘‘ ",
"#原始数据集有891行 ",
"sourceRow=891 ",
" ",
"‘‘‘ ",
"sourceRow是我们在最开始合并数据前知道的,原始数据集有总共有891条数据 ",
"从特征集合full_X中提取原始数据集提取前891行数据时,我们要减去1,因为行号是从0开始的。 ",
"‘‘‘ ",
"#原始数据集:特征 ",
"source_X = full_X.loc[0:sourceRow-1,:] ",
"#原始数据集:标签 ",
"source_y = full.loc[0:sourceRow-1,‘Survived‘] ",
" ",
"#预测数据集:特征 ",
"pred_X = full_X.loc[sourceRow:,:]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 817,
"outputs": [
{
"name": "stdout",
"text": [
"原始数据集有多少行: 891 ",
"原始数据集有多少行: 418 "
],
"output_type": "stream"
}
],
"source": [
"‘‘‘ ",
"确保这里原始数据集取的是前891行的数据,不然后面模型会有错误 ",
"‘‘‘ ",
"#原始数据集有多少行 ",
"print(‘原始数据集有多少行:‘,source_X.shape[0]) ",
"#预测数据集大小 ",
"print(‘原始数据集有多少行:‘,pred_X.shape[0])"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 818,
"outputs": [
{
"name": "stdout",
"text": [
"原始数据集特征: (891, 27) 训练数据集特征: (712, 27) 测试数据集特征: (179, 27) ",
"原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,) "
],
"output_type": "stream"
}
],
"source": [
"‘‘‘ ",
"从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test) ",
"train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test data ",
"train_data:所要划分的样本特征集 ",
"train_target:所要划分的样本结果 ",
"test_size:样本占比,如果是整数的话就是样本的数量 ",
"‘‘‘ ",
" ",
"#建立模型用的训练数据集和测试数据集 ",
"train_X, test_X, train_y, test_y = train_test_split(source_X , ",
" source_y, ",
" train_size=.8) ",
" ",
"#输出数据集大小 ",
"print (‘原始数据集特征:‘,source_X.shape, ",
" ‘训练数据集特征:‘,train_X.shape , ",
" ‘测试数据集特征:‘,test_X.shape) ",
" ",
"print (‘原始数据集标签:‘,source_y.shape, ",
" ‘训练数据集标签:‘,train_y.shape , ",
" ‘测试数据集标签:‘,test_y.shape)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 819,
"outputs": [
{
"data": {
"text/plain": "0 0.0 1 1.0 2 1.0 3 1.0 4 0.0 Name: Survived, dtype: float64"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 819
}
],
"source": [
"#原始数据查看 ",
"source_y.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 4.2 选择机器学习算法"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"选择一个机器学习算法,用于模型的训练。如果你是新手,建议从逻辑回归算法开始"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 820,
"outputs": [],
"source": [
"#第1步:导入算法 ",
"from sklearn.linear_model import LogisticRegression ",
"#第2步:创建模型:逻辑回归(logisic regression) ",
"model = LogisticRegression()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 821,
"outputs": [],
"source": [
"#随机森林Random Forests Model ",
"#from sklearn.ensemble import RandomForestClassifier ",
"#model = RandomForestClassifier(n_estimators=100)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 822,
"outputs": [],
"source": [
"#支持向量机Support Vector Machines ",
"#from sklearn.svm import SVC, LinearSVC ",
"#model = SVC()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 823,
"outputs": [],
"source": [
"#Gradient Boosting Classifier ",
"#from sklearn.ensemble import GradientBoostingClassifier ",
"#model = GradientBoostingClassifier()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 824,
"outputs": [],
"source": [
"#K-nearest neighbors ",
"#from sklearn.neighbors import KNeighborsClassifier ",
"#model = KNeighborsClassifier(n_neighbors = 3)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 825,
"outputs": [],
"source": [
"# Gaussian Naive Bayes ",
"#from sklearn.naive_bayes import GaussianNB ",
"#model = GaussianNB()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 4.3 训练模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 826,
"outputs": [
{
"data": {
"text/plain": "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=1, multi_class=‘warn‘, n_jobs=None, penalty=‘l2‘, random_state=None, solver=‘warn‘, tol=0.0001, verbose=0, warm_start=False)"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 826
}
],
"source": [
"#第3步:训练模型 ",
"model.fit( train_X , train_y)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 5.评估模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 827,
"outputs": [
{
"data": {
"text/plain": "0.6927374301675978"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 827
}
],
"source": [
"# 分类问题,score得到的是模型的正确率 ",
"model.score(test_X , test_y )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"# 6.方案实施(Deployment)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 6.1 得到预测结果上传到Kaggle"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"使用预测数据集到底预测结果,并保存到csv文件中,上传到Kaggle中,就可以看到排名。"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 828,
"outputs": [],
"source": [
"#使用机器学习模型,对预测数据集中的生存情况进行预测 ",
"pred_Y = model.predict(pred_X) ",
" ",
"‘‘‘ ",
"生成的预测值是浮点数(0.0,1,0) ",
"但是Kaggle要求提交的结果是整型(0,1) ",
"所以要对数据类型进行转换 ",
"‘‘‘ ",
"pred_Y=pred_Y.astype(int) ",
"#乘客id ",
"passenger_id = full.loc[sourceRow:,‘PassengerId‘] ",
"#数据框:乘客id,预测生存情况的值 ",
"predDf = pd.DataFrame( ",
" { ‘PassengerId‘: passenger_id , ",
" ‘Survived‘: pred_Y } ) ",
"predDf.shape ",
"predDf.head() ",
"#保存结果 ",
"predDf.to_csv( ‘titanic_pred.csv‘ , index = False )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 6.2 报告撰写"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"下次课程我们通过《数据可视化》来详细聊聊如何做一份数据分析报告"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 828,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% ",
"is_executing": false
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}

以上是关于实现泰坦尼克号预测源码和分析的主要内容,如果未能解决你的问题,请参考以下文章

泰坦尼克号乘客生存预测(XGBoost)

决策树算法泰坦尼克号乘客生存预测

Python进行泰坦尼克生存预测——数据探索分析!

利用python进行泰坦尼克生存预测——数据探索分析

机器学习决策树算法泰坦尼克号乘客生存预测

机器学习决策树算法泰坦尼克号乘客生存预测