萌新向Python数据分析及数据挖掘 第三章 机器学习常用算法 第二节 线性回归算法 (下)实操篇

Posted 跨界混子

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了萌新向Python数据分析及数据挖掘 第三章 机器学习常用算法 第二节 线性回归算法 (下)实操篇相关的知识,希望对你有一定的参考价值。

线性回归算法

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
In [ ]:
boston  = datasets.load_boston()
X = boston.data[:,5] #- RM       average number of rooms per dwelling
y = boston.target
print(X.shape)
print(y.shape)
In [ ]:
print(boston.DESCR) #数据描述
In [ ]:
plt.scatter(X,y)#使用单个变量 RM -price  用散点图表示 
 

Signature: plt.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None, **kwargs) Docstring: Make a scatter plot of x vs y.

Marker size is scaled by s and marker color is mapped to c.

Parameters

x, y : array_like, shape (n, ) Input data

s : scalar or array_like, shape (n, ), optional size in points^2. Default is rcParams[‘lines.markersize‘] ** 2.

c : color, sequence, or sequence of color, optional, default: ‘b‘ c can be a single color format string, or a sequence of color specifications of length N, or a sequence of N numbers to be mapped to colors using the cmap and norm specified via kwargs (see below). Note that c should not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. c can be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.

marker : ~matplotlib.markers.MarkerStyle, optional, default: ‘o‘ See ~matplotlib.markers for more information on the different styles of markers scatter supports. marker can be either an instance of the class or the text shorthand for a particular marker.

cmap : ~matplotlib.colors.Colormap, optional, default: None A ~matplotlib.colors.Colormap instance or registered name. cmap is only used if c is an array of floats. If None, defaults to rc image.cmap.

norm : ~matplotlib.colors.Normalize, optional, default: None A ~matplotlib.colors.Normalize instance is used to scale luminance data to 0, 1. norm is only used if c is an array of floats. If None, use the default :func:normalize.

vmin, vmax : scalar, optional, default: None vmin and vmax are used in conjunction with norm to normalize luminance data. If either are None, the min and max of the color array is used. Note if you pass a norminstance, your settings for vmin and vmax will be ignored.

alpha : scalar, optional, default: None The alpha blending value, between 0 (transparent) and 1 (opaque)

linewidths : scalar or array_like, optional, default: None If None, defaults to (lines.linewidth,).

verts : sequence of (x, y), optional If marker is None, these vertices will be used to construct the marker. The center of the marker is located at (0,0) in normalized units. The overall marker is rescaled by s.

edgecolors : color or sequence of color, optional, default: None If None, defaults to ‘face‘

If ‘face‘, the edge color will always be the same as
the face color.

If it is ‘none‘, the patch boundary will not
be drawn.

For non-filled markers, the `edgecolors` kwarg
is ignored and forced to ‘face‘ internally.

Returns

paths : ~matplotlib.collections.PathCollection

Other Parameters

**kwargs : ~matplotlib.collections.Collection properties

See Also

plot : to plot scatter plots when markers are identical in size and color

Notes

  • The plot function will be faster for scatterplots where markers don‘t vary in size or color.

  • Any or all of xys, and c may be masked arrays, in which case all masks will be combined and only unmasked points will be plotted.

    Fundamentally, scatter works with 1-D arrays; xys, and c may be input as 2-D arrays, but within scatter they will be flattened. The exception is c, which will be flattened only if its size matches the size of x and y.

.. note:: In addition to the above described arguments, this function can take a data keyword argument. If such a data argument is given, the following arguments are replaced by data[]:

* All arguments with the following names: ‘c‘, ‘color‘, ‘edgecolors‘, ‘facecolor‘, ‘facecolors‘, ‘linewidths‘, ‘s‘, ‘x‘, ‘y‘.
In [ ]:
X
In [ ]:
y.max()
 

Docstring: a.max(axis=None, out=None, keepdims=False)

In [ ]:
X = X[y < 50]#去掉y>=50de 
y = y[y < 50]
print(X.shape)
print(y.shape)
In [ ]:
plt.scatter(X,y)
 

多元线性回归

In [ ]:
X = boston.data 
y = boston.target
X = X[y < 50]
y = y[y < 50]
In [ ]:
from sklearn.model_selection import train_test_split #载入数据切分工具
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)#数据切分
In [ ]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
 

Init signature: LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1) Docstring:
Ordinary least squares Linear Regression.

Parameters

fit_intercept : boolean, optional, default True whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).

normalize : boolean, optional, default False This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use :class:sklearn.preprocessing.StandardScaler before calling fiton an estimator with normalize=False.

copy_X : boolean, optional, default True If True, X will be copied; else, it may be overwritten.

n_jobs : int, optional, default 1 The number of jobs to use for the computation. If -1 all CPUs are used. This will only provide speedup for n_targets > 1 and sufficient large problems.

Attributes

coef_ : array, shape (n_features, ) or (n_targets, n_features) Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

intercept_ : array Independent term in the linear model.

Notes

From the implementation point of view, this is just plain Ordinary Least Squares (scipy.linalg.lstsq) wrapped as a predictor object. File: c:usersqq123anaconda3libsite-packagessklearnlinear_modelase.py Type: ABCMeta

In [ ]:
lin_reg.fit(X_train,y_train)
 

Signature: lin_reg.fit(X, y, sample_weight=None) Docstring: Fit linear model.

Parameters

X : numpy array or sparse matrix of shape [n_samples,n_features] Training data

y : numpy array of shape [n_samples, n_targets] Target values. Will be cast to X‘s dtype if necessary

sample_weight : numpy array of shape [n_samples] Individual weights for each sample

.. versionadded:: 0.17
   parameter *sample_weight* support to LinearRegression.

Returns

self : returns an instance of self.

In [ ]:
lin_reg.coef_#系数
In [ ]:
lin_reg.intercept_#截距
In [ ]:
lin_reg.score(X_test,y_test)
In [ ]:
K近邻回归算法
In [ ]:
from sklearn.neighbors import KNeighborsRegressor #载入KNN分类器
In [ ]:
knn_reg = KNeighborsRegressor()# 设置分类器
knn_reg.fit(X_train,y_train)
knn_reg.score(X_test,y_test)
In [ ]:
from sklearn.model_selection import GridSearchCV
para_grid = [
    {
        ‘weights‘:[‘uniform‘], 
        ‘n_neighbors‘:[i for i in range(1,11)]
    },
    {
        ‘weights‘:[‘distance‘],
        ‘n_neighbors‘:[i for i in range(1,11)],
        ‘p‘:[i for i in range(1,6)]
    }
]
In [ ]:
knn_reg_grid = KNeighborsRegressor(n_jobs = -1)
grid_search = GridSearchCV(knn_reg_grid,para_grid,verbose =1)
grid_search.fit(X_train,y_train)
In [ ]:
grid_search.best_estimator_
In [ ]:
grid_search.best_score_
In [ ]:
grid_search.best_estimator_.score(X_test,y_test)
 

参数权重排序

In [ ]:
lin_reg.coef_#参数
In [ ]:
np.argsort(lin_reg.coef_)
 

Signature: np.argsort(a, axis=-1, kind=‘quicksort‘, order=None) Docstring: Returns the indices that would sort an array.

Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in sorted order.

Parameters

a : array_like Array to sort. axis : int or None, optional Axis along which to sort. The default is -1 (the last axis). If None, the flattened array is used. kind : {‘quicksort‘, ‘mergesort‘, ‘heapsort‘}, optional Sorting algorithm. order : str or list of str, optional When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.

Returns

index_array : ndarray, int Array of indices that sort a along the specified axis. If a is one-dimensional, a[index_array] yields a sorted a.

See Also

sort : Describes sorting algorithms used. lexsort : Indirect stable sort with multiple keys. ndarray.sort : Inplace sort. argpartition : Indirect partial sort.

Notes

See sort for notes on the different sorting algorithms.

As of NumPy 1.4.0 argsort works with real/complex arrays containing nan values. The enhanced sort order is documented in sort.

Examples

One dimensional array:

x = np.array([3, 1, 2]) np.argsort(x) array([1, 2, 0])

Two-dimensional array:

x = np.array([[0, 3], [2, 2]]) x array([[0, 3], [2, 2]])

np.argsort(x, axis=0) # sorts along first axis (down) array([[0, 1], [1, 0]])

np.argsort(x, axis=1) # sorts along last axis (across) array([[0, 1], [0, 1]])

Indices of the sorted elements of a N-dimensional array:

ind = np.unravel_index(np.argsort(x, axis=None), x.shape) ind (array([0, 1, 1, 0]), array([0, 0, 1, 1])) x[ind] # same as np.sort(x, axis=None) array([0, 2, 2, 3])

Sorting with keys:

x = np.array([(1, 0), (0, 1)], dtype=[(‘x‘, ‘<i4‘), (‘y‘, ‘<i4‘)]) x array([(1, 0), (0, 1)], dtype=[(‘x‘, ‘<i4‘), (‘y‘, ‘<i4‘)])

np.argsort(x, order=(‘x‘,‘y‘)) array([1, 0])

np.argsort(x, order=(‘y‘,‘x‘)) array([0, 1])

In [ ]:
lin_reg.coef_[np.argsort(lin_reg.coef_)]#升序
In [ ]:
boston.feature_names
In [ ]:
boston.feature_names[np.argsort(lin_reg.coef_)]

以上是关于萌新向Python数据分析及数据挖掘 第三章 机器学习常用算法 第二节 线性回归算法 (下)实操篇的主要内容,如果未能解决你的问题,请参考以下文章

萌新向Python数据分析及数据挖掘 第一章 Python基础 (上)未排版

萌新向Python数据分析及数据挖掘 第一章 Python基础 第八节 函数

萌新向Python数据分析及数据挖掘 第一章 Python基础 第十节 文件和异常

萌新向Python数据分析及数据挖掘 第一章 Python基础 第九节 类

萌新向Python数据分析及数据挖掘 第二章 pandas 第二节 Python Language Basics, IPython, and Jupyter Notebooks

萌新向Python数据分析及数据挖掘 第二章 pandas 第五节 Getting Started with pandas