2/15 简单线性回归

Posted Bossman listens to Indie

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了2/15 简单线性回归相关的知识,希望对你有一定的参考价值。

 

  • Regression: Predict a continuous response

    Linear regression

    Pros: fast, no tuning required, highly interpretable, well-understood

    Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)

    Form of linear regression

    技术分享图片

    • 技术分享图片 is the response
    • 技术分享图片 is the intercept
    • 技术分享图片 is the coefficient for 技术分享图片 (the first feature)
    • 技术分享图片 is the coefficient for 技术分享图片 (the nth feature)

    In this case:

    技术分享图片

    The 技术分享图片 values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions.import pandas as pdimport seaborn as sns:

    # allow plots to appear within the notebook
    %matplotlib inline
    
    # read CSV file directly from a URL and save the results
    data = pd.read_csv(http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv, index_col=0)#忽略第一列为index列
    data.head()
    
    # visualize the relationship between the features and the response using scatterplots
    sns.pairplot(data, x_vars=[TV,radio,newspaper], y_vars=sales, size=7, aspect=0.7, kind=reg)
    
    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    # default split is 75% for training and 25% for testing
    print(X_train.shape)
    print(y_train.shape)
    print(X_test.shape)
    print(y_test.shape)
    
    #Linear Regression in scikit-learn
    # import model
    from sklearn.linear_model import LinearRegression
    
    # instantiate
    linreg = LinearRegression()
    
    # fit the model to the training data (learn the coefficients)
    linreg.fit(X_train, y_train)
    
    #Interpreting model coefficients
    # print the intercept and coefficients
    print(linreg.intercept_)
    print(linreg.coef_)
    
    # pair the feature names with the coefficients
    list(zip(feature_cols, linreg.coef_))
    
    #Making predictions
    # make predictions on the testing set
    y_pred = linreg.predict(X_test)
    
    #Feature selection
    # create a Python list of feature names
    feature_cols = [TV, Radio]
    
    # use the list to select a subset of the original DataFrame
    X = data[feature_cols]
    
    # select a Series from the DataFrame
    y = data.Sales
    
    # split into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    # fit the model to the training data (learn the coefficients)
    linreg.fit(X_train, y_train)
    
    # make predictions on the testing set
    y_pred = linreg.predict(X_test)
    
    # compute the RMSE of our predictions
    print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

     Out Module: 

    [(‘TV‘, 0.046564567874150288),
     (‘Radio‘, 0.17915812245088836),
     (‘Newspaper‘, 0.0034504647111804065)]

    RMSE = 1.38790346994
     
    技术分享图片

    How do we interpret the TV coefficient (0.0466)?

    • For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
    • Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items.

    Important notes:

    • This is a statement of association, not causation.
    • If an increase in TV ad spending was associated with a decrease in sales, 技术分享图片 would be negative.

    Model evaluation metrics for regression

    Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

    Let‘s create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

    Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

                                                               技术分享图片

    Mean Squared Error (MSE) is the mean of the squared errors:

                                                              技术分享图片

    Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

                                                           技术分享图片

    Comparing these metrics:

    • MAE is the easiest to understand, because it‘s the average error.
    • MSE is more popular than MAE, because MSE "punishes" larger errors.
    • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units

    The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.




以上是关于2/15 简单线性回归的主要内容,如果未能解决你的问题,请参考以下文章

02_有监督学习--简单线性回归模型(梯度下降法代码实现)

03_有监督学习--简单线性回归模型(调用 sklearn 库代码实现)

TensorFlow实现简单线性回归示例代码

机器学习-简单线性回归

python之简单线性回归分析

线性回归模型|机器学习