- Regression: Predict a continuous response
Linear regression
Pros: fast, no tuning required, highly interpretable, well-understood
Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)
Form of linear regression
- is the response
- is the intercept
- is the coefficient for (the first feature)
- is the coefficient for (the nth feature)
In this case:
The values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions.import pandas as pdimport seaborn as sns:
# allow plots to appear within the notebook %matplotlib inline # read CSV file directly from a URL and save the results data = pd.read_csv(‘http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv‘, index_col=0)#忽略第一列为index列 data.head() # visualize the relationship between the features and the response using scatterplots sns.pairplot(data, x_vars=[‘TV‘,‘radio‘,‘newspaper‘], y_vars=‘sales‘, size=7, aspect=0.7, kind=‘reg‘) from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) # default split is 75% for training and 25% for testing print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape) #Linear Regression in scikit-learn # import model from sklearn.linear_model import LinearRegression # instantiate linreg = LinearRegression() # fit the model to the training data (learn the coefficients) linreg.fit(X_train, y_train) #Interpreting model coefficients # print the intercept and coefficients print(linreg.intercept_) print(linreg.coef_) # pair the feature names with the coefficients list(zip(feature_cols, linreg.coef_)) #Making predictions # make predictions on the testing set y_pred = linreg.predict(X_test) #Feature selection # create a Python list of feature names feature_cols = [‘TV‘, ‘Radio‘] # use the list to select a subset of the original DataFrame X = data[feature_cols] # select a Series from the DataFrame y = data.Sales # split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) # fit the model to the training data (learn the coefficients) linreg.fit(X_train, y_train) # make predictions on the testing set y_pred = linreg.predict(X_test) # compute the RMSE of our predictions print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Out Module:
How do we interpret the TV coefficient (0.0466)?
- For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
- Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items.
Important notes:
- This is a statement of association, not causation.
- If an increase in TV ad spending was associated with a decrease in sales, would be negative.
Model evaluation metrics for regression
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.
Let‘s create some example numeric predictions, and calculate three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
Comparing these metrics:
- MAE is the easiest to understand, because it‘s the average error.
- MSE is more popular than MAE, because MSE "punishes" larger errors.
- RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units
The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.