Kaggle经典测试，泰坦尼克号的生存预测，机器学习实验----02

Posted 2021-09-04 hhh_Moon_hhh

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Kaggle经典测试，泰坦尼克号的生存预测，机器学习实验----02相关的知识，希望对你有一定的参考价值。

Kaggle经典测试，泰坦尼克号的生存预测，机器学习实验----02

文章目录

- Kaggle经典测试，泰坦尼克号的生存预测，机器学习实验----02

一、引言

泰坦尼克号（RMS Titanic），又译作铁达尼号，是英国白星航运公司下辖的一艘奥林匹克级游轮，排水量46000吨，泰坦尼克号是当时世界上体积最庞大、内部设施最豪华的客运轮船，有“永不沉没”的美誉。

然而不幸的是，在它的处女航中，泰坦尼克号便遭厄运——它从英国南安普敦出发驶向美国纽约。1912年4月14日23时40分左右，泰坦尼克号与一座冰山相撞，造成右舷船艏至船中部破裂，五间水密舱进水。4月15日凌晨2时20分左右，泰坦尼克船体断裂成两截后沉入大西洋底3700米处。2224名船员及乘客中，1517人丧生，其中仅333具罹难者遗体被寻回。泰坦尼克号沉没事故为和平时期死伤人数最为惨重的一次海难，其残骸直至1985年才被再度发现，目前受到联合国教育、科学及文化组织的保护。

二、问题

那么，问题来了，想在泰坦尼克号这次灾难中存活下来需要具备那些条件呢？

也就是说，如果实现知道一个人的所有情况，我们该如何判断这个人是否会遇难呢？

这样就需要机器学习来解决了。

三、问题分析

显然，在泰坦尼克号这次灾难中，一个人要么是遇难，要么是幸存，所以实际上是一个逻辑回归，但由于这是刚刚起步的一个实验，我们暂且不使用逻辑回归，依然采用线性回归来进行数据的处理和分析，后面我们会再次做这个实验，届时，我们将使用逻辑回归。然而本次就是用线性回归了啦。

四、具体操作

1、读取数据并且进行预处理

首先读入文件（csv），然后删除一些不必要的数据，接下来进行一定的调整，既可以得到下面的结果：



def read_data_of_csv(file_name):
    """
    read the csv files to get the data of the titanic accident
    :param file_name: the name of the file
    :return: df -> the data in the file that is opened above
    """
    df = pandas.read_csv(file_name)
    return df


if __name__ == '__main__':
    """
    main
    """

    # here we need not to split the train and the test !

    """
    1.get data and do the prior things before the machine learning
    """
    df_train = read_data_of_csv("titanic/train.csv")
    # print(df_train)
    # deal with the data first before machine learning
    df_train.drop("Embarked", axis=1, inplace=True)
    # i think embarked is not useful, so i delete this embarked line
    df_train.drop("Cabin", axis=1, inplace=True)
    # delete the cabin
    df_train.drop("Ticket", axis=1, inplace=True)
    # delete the ticket
    df_train.drop("Name", axis=1, inplace=True)
    # delete the name
    df_train.drop("PassengerId", axis=1, inplace=True)
    # delete the passenger id

    for int_number_of_len in range(len(df_train)):
        if df_train.loc[int_number_of_len, "Sex"] == "male":
            df_train.loc[int_number_of_len, "Sex"] = 1
            # if male then set the sex 1
        else:
            df_train.loc[int_number_of_len, "Sex"] = 0
            # if female then set the sex 0
    # change the introduction of sex from string to int 1 or 0
    df_train.dropna(axis=0, how="any", inplace=True)
    # delete the NaN data
    print(df_train)
    # show the result

最终的df_train:

     Survived  Pclass Sex   Age  SibSp  Parch     Fare
0           0       3   1  22.0      1      0   7.2500
1           1       1   0  38.0      1      0  71.2833
2           1       3   0  26.0      0      0   7.9250
3           1       1   0  35.0      1      0  53.1000
4           0       3   1  35.0      0      0   8.0500
..        ...     ...  ..   ...    ...    ...      ...
885         0       3   0  39.0      0      5  29.1250
886         0       2   1  27.0      0      0  13.0000
887         1       1   0  19.0      0      0  30.0000
889         1       1   1  26.0      0      0  30.0000
890         0       3   1  32.0      0      0   7.7500

[714 rows x 7 columns]

Process finished with exit code 0

2、划分标签以及特征并且初始化参数

需要将数据中的特征以及标签分开来进行处理

随后需要进行参数的初始化


    """
    2.split the label and the features
    """
    
    y_train = df_train.loc[:, "Survived"]
    print(y_train)
    # y of the train

    X_train = df_train.loc[:, "Pclass": "Fare"]
    print(X_train)
    # X of the train

    """
    3.set the initial params of the liner regression
    """
    alpha = float(input("input the alpha:\\n"))


    list_theta = []
    for number_of_the_total_thetas_list in range(6 + 1):
        list_theta[number_of_the_total_thetas_list] = float(
            input(f"input the theta {number_of_the_total_thetas_list}:\\n"))
        # input the theta

3、开始线性回归

这里是训练机器的代码

必须强调一下！！

参数选取非常重要！！！！



    """
    4.make the machine learning of the liner regression operations
    """
    iter_of_regression = int(input("input the number of iter times:\\n"))

    for num_of_iter_of_regression in range(iter_of_regression):
        # make iter_of_regression times of the regression
        h_x = list_theta[0] + \\
              list_theta[1] * X_train.loc[:, "Pclass"] + \\
              list_theta[2] * X_train.loc[:, "Sex"] + \\
              list_theta[3] * X_train.loc[:, "Age"] + \\
              list_theta[4] * X_train.loc[:, "SibSp"] + \\
              list_theta[5] * X_train.loc[:, "Parch"] + \\
              list_theta[6] * X_train.loc[:, "Fare"]
        # 7 theta
        # 6 x of the feature

        loss = \\
            y_train - h_x
        # calculate the loss

        # loss ^ 2
        print(loss.T.dot(loss))


        # the sum of loss:
        sum_loss = 0
        # plus the loss
        for o in range(len(loss)):
            # print(loss.iloc[o])
            sum_loss += loss.iloc[o]  # float
        # list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
        # print(loss)
        list_theta[0] += \\
            alpha * sum_loss / len(loss)
        # update the list_theta[0]

        # print(list(X_train.index))
        # 0's index
        for c in [1, 2, 3, 4, 5, 6]:
            list_theta[c] += \\
                alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
        # update the list theta of the params
        # X_train.loc[:, c - 1].T.dot(loss)
        # T transfer, dot dot_multiply

        # do all the thetas !!

        # continue

        print(list_theta)
        # theta
        # print(loss.T.dot(loss))
        # print(loss.T.dot(loss))
        # loss ^ 2

        continue

4、测试模型并且进行打分


    """
    5.do the test of this project
    """

    # the same operation
    df_test = read_data_of_csv("titanic/test.csv")
    df_test.drop("Embarked", axis=1, inplace=True)
    df_test.drop("Cabin", axis=1, inplace=True)
    df_test.drop("Ticket", axis=1, inplace=True)
    df_test.drop("Name", axis=1, inplace=True)
    df_test.drop("PassengerId", axis=1, inplace=True)
    # delete

    for int_number_of_len in range(len(df_test)):
        if df_test.loc[int_number_of_len, "Sex"] == "male":
            df_test.loc[int_number_of_len, "Sex"] = 1
            # if male then set the sex 1
        else:
            df_test.loc[int_number_of_len, "Sex"] = 0
            # if female then set the sex 0
    # change the introduction of sex from string to int 1 or 0

    # delete the NaNs
    df_test.dropna(axis=0, how="any", inplace=True)
    # delete the NaN data

    print(df_test)
    # show the result

    y_test = df_test.loc[:, "Survived"]

    X_test = df_test.loc[:, "Pclass":"Fare"]

    print(y_test)
    print(X_test)

    test_h_x = list_theta[0] + \\
        list_theta[1] * X_train.loc[:, "Pclass"] + \\
        list_theta[2] * X_train.loc[:, "Sex"] + \\
        list_theta[3] * X_train.loc[:, "Age"] + \\
        list_theta[4] * X_train.loc[:, "SibSp"] + \\
        list_theta[5] * X_train.loc[:, "Parch"] + \\
        list_theta[6] * X_train.loc[:, "Fare"]
    # the final function

    """
    6.score the model
    """

    N = len(y_test)
    # the total of the test number
    num_of_win_the_prediction_of_the_model = 0
    # predict the result

    # as long as we are right, whether the person is alive or dead, it does not matter
    for win_number_of_the_test_of_each in range(N):
        if test_h_x.iloc[win_number_of_the_test_of_each] < 0.5:
            test_h_x.iloc[win_number_of_the_test_of_each] = 0
            # prediction < 0.5 => 0
            if y_test.iloc[win_number_of_the_test_of_each] == 0:
                num_of_win_the_prediction_of_the_model += 1
                # right!
                # right, so we num of win ++
            else:
                pass
                # wrong

        else:
            test_h_x.iloc[win_number_of_the_test_of_each] = 1
            # prediction >= 0.5 => 1
            if y_test.iloc[win_number_of_the_test_of_each] == 1:
                num_of_win_the_prediction_of_the_model += 1
                # right
            else:
                pass
                # wrong

5、保存数据结果

我们打开一个txt文档，将数据保存在里面。


    """
    7.save the model
    """

    with open("result.txt", "w") as f:
        f.write("result record:\\n")
        # result

        f.write("the alpha:\\n")
        f.write(f"{alpha}")
        f.write("\\n")
        # 1.alpha

        f.write("the thetas of the model:\\n")
        recording_number_position = 1
        for theta_of_the_last in list_theta:
            f.write(f"{recording_number_position}. ")
            f.write(f"{theta_of_the_last}")
            f.write("\\n")
            recording_number_position += 1
            continue
        # 2.write the thetas
        f.write("\\n")

        f.write("features:\\n")
        r_n_p = 1
        for data_of_feature in list(X_train):
            f.write(f"{r_n_p}. ")
            f.write(f"{data_of_feature}")
            f.write("\\n")
            r_n_p += 1
            continue
        f.write("\\n")
        # 3.features

        f.write("score:\\n")
        f.write(f"{num_of_win_the_prediction_of_the_model / N}")
        f.write("\\n")
        f.write(f"  or the 100 :{100 * num_of_win_the_prediction_of_the_model / N}")
        f.write("\\n")
        # 4.score

        f.close()
        # close the file

    sys.exit("bye bye!")
"""
END
"""

最后的数据的呈现：

result record:
the alpha:
0.0001
the thetas of the model:
1. 1.2486050331704237
2. -0.16522774554911307
3. -0.48151480883845743
4. -0.00541835063447703
5. -0.04947303449597998
6. -0.011060136497625706
7. 0.0005749482239116638

features:
1. Pclass
2. Sex
3. Age
4. SibSp
5. Parch
6. Fare

score:
0.4954682779456193
  or the 100 :49.546827794561935

从上面的数据可以看出来呢，这个模型并不是很好，以至于连及格都没有及格了啦，wwww~~

不过没有关系，后面我们会使用逻辑回归再做一次这个案例的啦，后面那个显然会好一点哦。

五、完整代码

"""

the titanic survival prediction of machine learning

by
author: Hu Yu Xuan

at
time: 2021/8/9

using
method: liner regression

"""


import numpy
import pandas
import sys


def read_data_of_csv(file_name):
    """
    read the csv files to get the data of the titanic accident
    :param file_name: the name of the file
    :return: df -> the data in the file that is opened above
    """
    df = pandas.read_csv(file_name)
    return df


if __name__ == '__main__':
    """
    main
    """

    # here we need not to split the train and the test !

    """
    1.get data and do the prior things before the machine learning
    """

    df_train = read_data_of_csv("titanic/train.csv")
    # print(df_train)
    # deal with the data first before machine learning
    df_train.drop("Embarked", axis=1, inplace=True)
    # i think embarked is not useful, so i delete this embarked line
    df_train.drop("Cabin", axis=1, inplace=True)
    # delete the cabin
    df_train.drop("Ticket", axis=1, inplace=True)
    # delete the ticket
    df_train.drop("Name", axis=1, inplace=True)
    # delete the name
    df_train.drop("PassengerId", axis=1, inplace=True)
    # delete the passenger id

    for int_number_of_len in range(len(df_train)):
        if df_train.loc[int_number_of_len, "Sex"] == "male":
            df_train.loc[int_number_of_len, "Sex"] = 1
            # if male then set the sex 1
        else:
            df_train.loc[int_number_of_len, "Sex"] = 0
            # if female then set the sex 0
    # change the introduction of sex from string to int 1 or 0
    df_train.dropna(axis=0, how="any", inplace=True)
    # delete the NaN data
    print(df_train)
    # show the result

    """
    2.split the label and the features
    """

    y_train = df_train.loc[:, "Survived"]
    print(y_train)
    # y of the train

    X_train = df_train.loc[:, "Pclass": "Fare"]
    print(X_train)
    # X of the train

    """
    3.set the initial params of the liner regression
    """

    alpha = float(input("input the alpha:\\n"))
    # alpha
    list_theta = [0, 0, 0, 0, 0, 0, 0]
    # 7 params
    for number_of_the_total_thetas_list in range(6 + 1):
        list_theta[number_of_the_total_thetas_list] = float(
            input(f"input the theta {number_of_the_total_thetas_list}:\\n"))
        # input the theta

    """
    4.make the machine learning of the liner regression operations
    """

    iter_of_regression = int(input("input the number of iter times:\\n"))

    for num_of_iter_of_regression in range(iter_of_regression):
        # make iter_of_regression times of the regression
        h_x = list_theta[0] + \\
              list_theta[1] * X_train.loc[:, "Pclass"] + \\
              list_theta[2] * X_train.loc[:, "Sex"] + \\
              list_theta[3] * X_train.loc[:, "Age"] + \\
              list_theta[4] * X_train.loc[:, "SibSp"] + \\
              list_theta[5] * X_train.loc[:, "Parch"] + \\
              list_theta[6] * X_train.loc[:, "Fare"]
        以上是关于Kaggle经典测试，泰坦尼克号的生存预测，机器学习实验----02的主要内容，如果未能解决你的问题，请参考以下文章 
 机器学习第一步——用逻辑回归及随机森林实现泰坦尼克号的生存预测
 Kaggle项目泰坦尼克号预测生存情况（上）-------数据预处理
 kaggle入门之Titanic生存预测
 Kaggle系列之预测泰坦尼克号人员的幸存与死亡(随机森林模型)
 Kaggle实战入门：泰坦尼克号生还预测（进阶版）
 泰坦尼克号生存预测