Kaggle经典测试,泰坦尼克号的生存预测,机器学习实验----02
Posted hhh_Moon_hhh
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Kaggle经典测试,泰坦尼克号的生存预测,机器学习实验----02相关的知识,希望对你有一定的参考价值。
Kaggle经典测试,泰坦尼克号的生存预测,机器学习实验----02
文章目录
一、引言
泰坦尼克号(RMS Titanic),又译作铁达尼号,是英国白星航运公司下辖的一艘奥林匹克级游轮,排水量46000吨,泰坦尼克号是当时世界上体积最庞大、内部设施最豪华的客运轮船,有“永不沉没”的美誉 。
然而不幸的是,在它的处女航中,泰坦尼克号便遭厄运——它从英国南安普敦出发驶向美国纽约。1912年4月14日23时40分左右,泰坦尼克号与一座冰山相撞,造成右舷船艏至船中部破裂,五间水密舱进水。4月15日凌晨2时20分左右,泰坦尼克船体断裂成两截后沉入大西洋底3700米处。2224名船员及乘客中,1517人丧生,其中仅333具罹难者遗体被寻回。泰坦尼克号沉没事故为和平时期死伤人数最为惨重的一次海难,其残骸直至1985年才被再度发现,目前受到联合国教育、科学及文化组织的保护。
二、问题
那么,问题来了,想在泰坦尼克号这次灾难中存活下来需要具备那些条件呢?
也就是说,如果实现知道一个人的所有情况,我们该如何判断这个人是否会遇难呢?
这样就需要机器学习来解决了。
三、问题分析
显然,在泰坦尼克号这次灾难中,一个人要么是遇难,要么是幸存,所以实际上是一个逻辑回归,但由于这是刚刚起步的一个实验,我们暂且不使用逻辑回归,依然采用线性回归来进行数据的处理和分析,后面我们会再次做这个实验,届时,我们将使用逻辑回归。然而本次就是用线性回归了啦。
四、具体操作
1、读取数据并且进行预处理
首先读入文件(csv),然后删除一些不必要的数据,接下来进行一定的调整,既可以得到下面的结果:
def read_data_of_csv(file_name):
"""
read the csv files to get the data of the titanic accident
:param file_name: the name of the file
:return: df -> the data in the file that is opened above
"""
df = pandas.read_csv(file_name)
return df
if __name__ == '__main__':
"""
main
"""
# here we need not to split the train and the test !
"""
1.get data and do the prior things before the machine learning
"""
df_train = read_data_of_csv("titanic/train.csv")
# print(df_train)
# deal with the data first before machine learning
df_train.drop("Embarked", axis=1, inplace=True)
# i think embarked is not useful, so i delete this embarked line
df_train.drop("Cabin", axis=1, inplace=True)
# delete the cabin
df_train.drop("Ticket", axis=1, inplace=True)
# delete the ticket
df_train.drop("Name", axis=1, inplace=True)
# delete the name
df_train.drop("PassengerId", axis=1, inplace=True)
# delete the passenger id
for int_number_of_len in range(len(df_train)):
if df_train.loc[int_number_of_len, "Sex"] == "male":
df_train.loc[int_number_of_len, "Sex"] = 1
# if male then set the sex 1
else:
df_train.loc[int_number_of_len, "Sex"] = 0
# if female then set the sex 0
# change the introduction of sex from string to int 1 or 0
df_train.dropna(axis=0, how="any", inplace=True)
# delete the NaN data
print(df_train)
# show the result
最终的df_train:
Survived Pclass Sex Age SibSp Parch Fare
0 0 3 1 22.0 1 0 7.2500
1 1 1 0 38.0 1 0 71.2833
2 1 3 0 26.0 0 0 7.9250
3 1 1 0 35.0 1 0 53.1000
4 0 3 1 35.0 0 0 8.0500
.. ... ... .. ... ... ... ...
885 0 3 0 39.0 0 5 29.1250
886 0 2 1 27.0 0 0 13.0000
887 1 1 0 19.0 0 0 30.0000
889 1 1 1 26.0 0 0 30.0000
890 0 3 1 32.0 0 0 7.7500
[714 rows x 7 columns]
Process finished with exit code 0
2、划分标签以及特征并且初始化参数
需要将数据中的特征以及标签分开来进行处理
随后需要进行参数的初始化
"""
2.split the label and the features
"""
y_train = df_train.loc[:, "Survived"]
print(y_train)
# y of the train
X_train = df_train.loc[:, "Pclass": "Fare"]
print(X_train)
# X of the train
"""
3.set the initial params of the liner regression
"""
alpha = float(input("input the alpha:\\n"))
list_theta = []
for number_of_the_total_thetas_list in range(6 + 1):
list_theta[number_of_the_total_thetas_list] = float(
input(f"input the theta {number_of_the_total_thetas_list}:\\n"))
# input the theta
3、开始线性回归
这里是训练机器的代码
必须强调一下!!
参数选取非常重要!!!!
"""
4.make the machine learning of the liner regression operations
"""
iter_of_regression = int(input("input the number of iter times:\\n"))
for num_of_iter_of_regression in range(iter_of_regression):
# make iter_of_regression times of the regression
h_x = list_theta[0] + \\
list_theta[1] * X_train.loc[:, "Pclass"] + \\
list_theta[2] * X_train.loc[:, "Sex"] + \\
list_theta[3] * X_train.loc[:, "Age"] + \\
list_theta[4] * X_train.loc[:, "SibSp"] + \\
list_theta[5] * X_train.loc[:, "Parch"] + \\
list_theta[6] * X_train.loc[:, "Fare"]
# 7 theta
# 6 x of the feature
loss = \\
y_train - h_x
# calculate the loss
# loss ^ 2
print(loss.T.dot(loss))
# the sum of loss:
sum_loss = 0
# plus the loss
for o in range(len(loss)):
# print(loss.iloc[o])
sum_loss += loss.iloc[o] # float
# list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
# print(loss)
list_theta[0] += \\
alpha * sum_loss / len(loss)
# update the list_theta[0]
# print(list(X_train.index))
# 0's index
for c in [1, 2, 3, 4, 5, 6]:
list_theta[c] += \\
alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
# update the list theta of the params
# X_train.loc[:, c - 1].T.dot(loss)
# T transfer, dot dot_multiply
# do all the thetas !!
# continue
print(list_theta)
# theta
# print(loss.T.dot(loss))
# print(loss.T.dot(loss))
# loss ^ 2
continue
4、测试模型并且进行打分
"""
5.do the test of this project
"""
# the same operation
df_test = read_data_of_csv("titanic/test.csv")
df_test.drop("Embarked", axis=1, inplace=True)
df_test.drop("Cabin", axis=1, inplace=True)
df_test.drop("Ticket", axis=1, inplace=True)
df_test.drop("Name", axis=1, inplace=True)
df_test.drop("PassengerId", axis=1, inplace=True)
# delete
for int_number_of_len in range(len(df_test)):
if df_test.loc[int_number_of_len, "Sex"] == "male":
df_test.loc[int_number_of_len, "Sex"] = 1
# if male then set the sex 1
else:
df_test.loc[int_number_of_len, "Sex"] = 0
# if female then set the sex 0
# change the introduction of sex from string to int 1 or 0
# delete the NaNs
df_test.dropna(axis=0, how="any", inplace=True)
# delete the NaN data
print(df_test)
# show the result
y_test = df_test.loc[:, "Survived"]
X_test = df_test.loc[:, "Pclass":"Fare"]
print(y_test)
print(X_test)
test_h_x = list_theta[0] + \\
list_theta[1] * X_train.loc[:, "Pclass"] + \\
list_theta[2] * X_train.loc[:, "Sex"] + \\
list_theta[3] * X_train.loc[:, "Age"] + \\
list_theta[4] * X_train.loc[:, "SibSp"] + \\
list_theta[5] * X_train.loc[:, "Parch"] + \\
list_theta[6] * X_train.loc[:, "Fare"]
# the final function
"""
6.score the model
"""
N = len(y_test)
# the total of the test number
num_of_win_the_prediction_of_the_model = 0
# predict the result
# as long as we are right, whether the person is alive or dead, it does not matter
for win_number_of_the_test_of_each in range(N):
if test_h_x.iloc[win_number_of_the_test_of_each] < 0.5:
test_h_x.iloc[win_number_of_the_test_of_each] = 0
# prediction < 0.5 => 0
if y_test.iloc[win_number_of_the_test_of_each] == 0:
num_of_win_the_prediction_of_the_model += 1
# right!
# right, so we num of win ++
else:
pass
# wrong
else:
test_h_x.iloc[win_number_of_the_test_of_each] = 1
# prediction >= 0.5 => 1
if y_test.iloc[win_number_of_the_test_of_each] == 1:
num_of_win_the_prediction_of_the_model += 1
# right
else:
pass
# wrong
5、保存数据结果
我们打开一个txt文档,将数据保存在里面。
"""
7.save the model
"""
with open("result.txt", "w") as f:
f.write("result record:\\n")
# result
f.write("the alpha:\\n")
f.write(f"{alpha}")
f.write("\\n")
# 1.alpha
f.write("the thetas of the model:\\n")
recording_number_position = 1
for theta_of_the_last in list_theta:
f.write(f"{recording_number_position}. ")
f.write(f"{theta_of_the_last}")
f.write("\\n")
recording_number_position += 1
continue
# 2.write the thetas
f.write("\\n")
f.write("features:\\n")
r_n_p = 1
for data_of_feature in list(X_train):
f.write(f"{r_n_p}. ")
f.write(f"{data_of_feature}")
f.write("\\n")
r_n_p += 1
continue
f.write("\\n")
# 3.features
f.write("score:\\n")
f.write(f"{num_of_win_the_prediction_of_the_model / N}")
f.write("\\n")
f.write(f" or the 100 :{100 * num_of_win_the_prediction_of_the_model / N}")
f.write("\\n")
# 4.score
f.close()
# close the file
sys.exit("bye bye!")
"""
END
"""
最后的数据的呈现:
result record:
the alpha:
0.0001
the thetas of the model:
1. 1.2486050331704237
2. -0.16522774554911307
3. -0.48151480883845743
4. -0.00541835063447703
5. -0.04947303449597998
6. -0.011060136497625706
7. 0.0005749482239116638
features:
1. Pclass
2. Sex
3. Age
4. SibSp
5. Parch
6. Fare
score:
0.4954682779456193
or the 100 :49.546827794561935
从上面的数据可以看出来呢,这个模型并不是很好,以至于连及格都没有及格了啦,wwww~~
不过没有关系,后面我们会使用逻辑回归再做一次这个案例的啦,后面那个显然会好一点哦。
五、完整代码
"""
the titanic survival prediction of machine learning
by
author: Hu Yu Xuan
at
time: 2021/8/9
using
method: liner regression
"""
import numpy
import pandas
import sys
def read_data_of_csv(file_name):
"""
read the csv files to get the data of the titanic accident
:param file_name: the name of the file
:return: df -> the data in the file that is opened above
"""
df = pandas.read_csv(file_name)
return df
if __name__ == '__main__':
"""
main
"""
# here we need not to split the train and the test !
"""
1.get data and do the prior things before the machine learning
"""
df_train = read_data_of_csv("titanic/train.csv")
# print(df_train)
# deal with the data first before machine learning
df_train.drop("Embarked", axis=1, inplace=True)
# i think embarked is not useful, so i delete this embarked line
df_train.drop("Cabin", axis=1, inplace=True)
# delete the cabin
df_train.drop("Ticket", axis=1, inplace=True)
# delete the ticket
df_train.drop("Name", axis=1, inplace=True)
# delete the name
df_train.drop("PassengerId", axis=1, inplace=True)
# delete the passenger id
for int_number_of_len in range(len(df_train)):
if df_train.loc[int_number_of_len, "Sex"] == "male":
df_train.loc[int_number_of_len, "Sex"] = 1
# if male then set the sex 1
else:
df_train.loc[int_number_of_len, "Sex"] = 0
# if female then set the sex 0
# change the introduction of sex from string to int 1 or 0
df_train.dropna(axis=0, how="any", inplace=True)
# delete the NaN data
print(df_train)
# show the result
"""
2.split the label and the features
"""
y_train = df_train.loc[:, "Survived"]
print(y_train)
# y of the train
X_train = df_train.loc[:, "Pclass": "Fare"]
print(X_train)
# X of the train
"""
3.set the initial params of the liner regression
"""
alpha = float(input("input the alpha:\\n"))
# alpha
list_theta = [0, 0, 0, 0, 0, 0, 0]
# 7 params
for number_of_the_total_thetas_list in range(6 + 1):
list_theta[number_of_the_total_thetas_list] = float(
input(f"input the theta {number_of_the_total_thetas_list}:\\n"))
# input the theta
"""
4.make the machine learning of the liner regression operations
"""
iter_of_regression = int(input("input the number of iter times:\\n"))
for num_of_iter_of_regression in range(iter_of_regression):
# make iter_of_regression times of the regression
h_x = list_theta[0] + \\
list_theta[1] * X_train.loc[:, "Pclass"] + \\
list_theta[2] * X_train.loc[:, "Sex"] + \\
list_theta[3] * X_train.loc[:, "Age"] + \\
list_theta[4] * X_train.loc[:, "SibSp"] + \\
list_theta[5] * X_train.loc[:, "Parch"] + \\
list_theta[6] * X_train.loc[:, "Fare"]
以上是关于Kaggle经典测试,泰坦尼克号的生存预测,机器学习实验----02的主要内容,如果未能解决你的问题,请参考以下文章
机器学习第一步——用逻辑回归及随机森林实现泰坦尼克号的生存预测
Kaggle项目泰坦尼克号预测生存情况(上)-------数据预处理