斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)

Posted 月来客栈

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)相关的知识,希望对你有一定的参考价值。


逻辑回归(Logistic Regression)



1. 分类(Classification)


The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. For instance, if we are trying to build a spam classifier for email, then x ( i ) x^(i) x(i)may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈0,1. 0 is also called the negative class, and 1 the positive class.


简而言之,分类就是通过一系列的特征值,来将数据集分成不同的类别。也就是说其最终的输出 y y y是离散的值。比如垃圾邮件的分类。



2. 假设函数(Hypothesis function)

逻辑回归中的假设函数在本质与意义上同线性回归中的假设函数,仅仅只是在形式上发生了变化。


We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x. However, it is easy to construct examples below where this method performs very poorly.


example 1:

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_classifica

在上面的图片中,我们可以用以下表达式来表示假设函数:

当 h θ ( x ) ≥ 0.5 h_\\theta(x)\\geq0.5 hθ​(x)≥0.5, y = 1 y=1 y=1;

当 h θ ( x ) < 0.5 h_\\theta(x)<0.5 hθ​(x)<0.5, y = 0 y=0 y=0; (至于为什么是0.5,第六周课程会讲到。简单说为了提高准确度你可以设置得更大,比如0.9,但这并不代表此时的模型最优)

但是这样表示的问题就是,如果此时在添加一条数据(如下图),这个表达式就不适用了。

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_线性回归_02

example 2:

在逻辑回归中, 0 ≤ h θ ( x ) ≤ 1 0\\leq h_\\theta(x) \\leq1 0≤hθ(x)≤1(因为 h θ ( x ) h_\\theta(x) hθ(x)表示的是 y = 1 y=1 y=1的概率);而在线性回归中 h θ ( x ) h_\\theta(x) hθ(x)的取值范围可能大于1或者小于0,并且其值也不表示某种情况的概率。

Logistic Regression Model:

h θ ( x ) = g ( θ T x ) , g ( z ) = 1 1 + e − z h_\\theta(x) = g(\\theta^Tx),g(z) = \\frac11+e^-z hθ​(x)=g(θTx),g(z)=1+e−z1​;

h θ ( x ) = 1 1 + e − θ T x h_\\theta(x) = \\frac11+e^-\\theta^Tx hθ​(x)=1+e−θTx1​;


h θ ( x ) h_\\theta(x) hθ(x) will give us the probability that our output is 1. For example, h θ ( x ) h_\\theta(x) hθ(x)=0.7 gives us a probability of 70% that our output is 1.


h θ ( x ) = h_\\theta(x) = hθ(x)=(表示Y=1时的概率)

P ( y = 0 ∣ x ; θ ) + P ( y = 1 ∣ x ; θ ) = 1 P(y=0|x;\\theta) + P(y=1|x;\\theta) = 1 P(y=0∣x;θ)+P(y=1∣x;θ)=1

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_overfittin_03

其中 g ( x ) g(x) g(x)称为S型函数(sigmoid function)或者逻辑函数(logistic function)



3. 决策边界(Decision boundary)


In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:


为了解决离散值0和1的分类问题,我们可以将假设函数转化为如下形式:


h θ ( x ) ≥ 0.5 → y = 1 h_\\theta(x) \\geq 0.5→y=1 hθ​(x)≥0.5→y=1
h θ ( x ) < 0.5 → y = 0 h_\\theta(x) < 0.5→y=0 hθ​(x)<0.5→y=0


也就是说当 h θ ( x ) h_\\theta(x) hθ​(x)大于0.5时,我们就可以认为 y y y的取值为1了,因为超过了一半的概率。



同时,根据 h θ ( x ) = g ( θ T x ) = g ( z ) ( z = θ T x ) h_\\theta(x) = g(\\theta^Tx) = g(z) (z=\\theta^Tx ) hθ​(x)=g(θTx)=g(z)(z=θTx)的图像,我们可以得出以下结论:

当$ z\\geq0 时 , 时, 时,g(z)\\geq0.5 ; 即 ;即 ;即h_\\theta(x)\\geq0.5 时 , 时, 时,y=1$

当$ z<0 时 , 时, 时,g(z)<0.5 ; 即 ;即 ;即h_\\theta(x)<0.5 时 , 时, 时,y=0$

立即推:

当 θ T x ≥ 0 \\theta^Tx \\geq0 θTx≥0时, y = 1 y=1 y=1; 当 θ T x < 0 \\theta^Tx <0 θTx<0时, y = 0 y=0 y=0;也就是说,此时用$\\theta^Tx 把 数 据 集 分 成 了 两 个 部 分 。 因 此 , 我 们 把 把数据集分成了两个部分。因此,我们把 把数据集分成了两个部分。因此,我们把\\theta^Tx =0$这条直(或曲)线称之为决策边界。注意,决策边界仅仅只是假设函数的性质,与其他无关。


The Decision Boundary is a property of the hypothesis including the parameters θ 0 , θ 1 , θ 2 ⋯ \\theta_0,\\theta_1,\\theta_2\\cdots θ0,θ1,θ2⋯, which is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function. And the data set is only used to fit the parameters theta.


看一个例子:

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_overfittin_04

已知$\\theta_0 = -3, \\theta_1 = 1, \\theta_2 = 1; \\to h_\\theta(x) = g(-3 + x_1 + x_2) $,由前面推导可知:

θ T x = − 3 + x 1 + x 2 ≥ 0 → y = 1 \\theta^Tx = -3 + x_1 + x_2 \\geq0 \\to y=1 θTx=−3+x1​+x2​≥0→y=1

θ T x = − 3 + x 1 + x 2 < 0 → y = 0 \\theta^Tx = -3 + x_1 + x_2 <0 \\to y=0 θTx=−3+x1​+x2​<0→y=0

所以,原数据集被**决策边界 θ T x = − 3 + x 1 + x 2 = 0 \\theta^Tx =-3 + x_1 + x_2 = 0 θTx=−3+x1+x2=0**分割成如下两个部分,右上方表示 y = 1 y=1 y=1的部分,左下方表示 y = 0 y=0 y=0的部分。

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_过度拟合_05

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_classifica_06



4. 代价函数(Cost function)

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_线性回归_07


We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy looks like the figure left above, causing many local optima. In other words, it will not be a convex function.
Instead, our cost function for logistic regression looks like:


J ( θ ) = 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) , y ( i ) ) J(\\theta) = \\frac1m\\displaystyle\\sum_i=1^mCost(h_\\theta(x^(i),y^(i)) J(θ)=m1​i=1∑m​Cost(hθ​(x(i),y(i));

$Cost(h_\\theta(x),y) =\\begincases -\\log (h_\\theta(x)),& \\text if y = 1 y = 1 y=1\\ -\\log(1-h_\\theta(x)), & \\textif y = 0 y=0 y=0 \\endcases$

Note: 0 ≤ h θ ( x ) ≤ 1 表 示 的 是 y = 1 0\\leq h_\\theta(x) \\leq1表示的是y=1 0≤hθ(x)≤1表示的是y=1的概率

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_过度拟合_08

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_classifica_09

由于 y y y的取值只有0和1,所以原式又可以写成如下形式:

J ( θ ) = 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) , y ( i ) ) = − 1 m [ ∑ i = 1 m y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] J(\\theta) = \\frac1m\\displaystyle\\sum_i=1^mCost(h_\\theta(x^(i),y^(i)) = -\\frac1m[\\displaystyle\\sum_i=1^my^(i)\\log h_\\theta(x^(i)) + (1 - y^(i))\\log (1 - h_\\theta(x^(i)))] J(θ)=m1i=1∑mCost(hθ(x(i),y(i))=−m1[i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]


A vectorized implementation is:

h = g ( X θ ) h=g(X\\theta) h=g(Xθ)

J ( θ ) = 1 m ( − y T log ⁡ ( h ) − ( 1 − y ) T log ⁡ ( 1 − h ) ) J(\\theta) = \\frac1m(-y^T\\log(h) - (1-y)^T\\log (1-h)) J(θ)=m1(−yTlog(h)−(1−y)Tlog(1−h))



If our correct answer ‘y’ is 0:
then the cost function will be 0 if our hypothesis function also outputs 0.
then the cost function will approach infinity,If our hypothesis approaches 1.

If our correct answer ‘y’ is 1:
then the cost function will be 0 if our hypothesis function outputs 1.
then the cost function will approach infinity, If our hypothesis approaches 0.

Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.




5. 梯度下降(Gradient Descent)

有了代价函数,下一步就是用梯度下降算法进行最小化Minimize J ( θ ) J(\\theta) J(θ)了。不管是在Linear regression model 中还是Logistic regression model中,梯度下降算法的最基本形式都是一样的,只是 J ( θ ) J(\\theta) J(θ)的形式发生了改变。


Gradient Descent
Remember that the general form of gradient descent is:


R e p e a t Repeat \\ Repeat

   θ j : = θ j − α ∂ ∂ θ j J ( θ ) \\theta_j := \\theta_j - \\alpha\\frac\\partial\\partial\\theta_jJ(\\theta) θj​:=θj​−α∂θj​∂​J(θ)

\\



在逻辑回归中:

J ( θ ) = − 1 m [ ∑ i = 1 m y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] J(\\theta) = -\\frac1m[\\displaystyle\\sum_i=1^my^(i)\\log h_\\theta(x^(i)) + (1 - y^(i))\\log (1 - h_\\theta(x^(i)))] J(θ)=−m1[i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]

所以,求导后的表达式如下:


We can work out the derivative part using calculus to get:


R e p e a t Repeat \\ Repeat

θ j : = θ j − α m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \\theta_j := \\theta_j - \\frac\\alpham\\displaystyle\\sum_i=1^m(h_\\theta(x^(i))-y^(i))x_j^(i) θj​:=θj​−mα​i=1∑m​(hθ​(x(i))−y(i))xj(i)​

\\


Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.


其中, h θ ( x ) = 1 1 + e − θ T X h_\\theta(x) = \\frac11+e^-\\theta^TX hθ(x)=1+e−θTX1;而在线性回归中 h θ ( x ) = θ T X h_\\theta(x)=\\theta^TX hθ(x)=θTX


A vectorized implementation is:


θ : = θ − α m ( h θ ( x ) − y ⃗ ) x j \\theta := \\theta - \\frac\\alpham(h_\\theta(x)- \\vecy)x_j θ:=θ−mα(hθ(x)−y )xj

推导见​​关于梯度下降算法的矢量化过程​



6. 进阶优化(Advanced Optimization)

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_overfittin_10


"Conjugate gradient", “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.


有一个观点是这样来描述梯度下降算法的:梯度下降算法做了两件事,第一计算 J ( θ ) J(\\theta) J(θ);第二计算 ∂ ∂ θ j J ( θ ) \\frac\\partial\\partial\\theta_jJ(\\theta) ∂θj∂J(θ)。当然除了Gradient descent 之外,还有其他三种算法也能做着两件事情(以此来最优化参数 θ \\theta θ),且比梯度下降算法更快,不用手动选择 α \\alpha α,但却更复杂。因为复杂,所以就不用我们自己来编写这些算法,使用开源的库即可。此时我们只需要自己写好cost function以及告诉matlab我们需要用那种算法来优化参数。

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_机器学习_11

如图,现在我们用Matlab中的函数fminunc来计算 J ( θ ) J(\\theta) J(θ)和 ∂ ∂ θ j J ( θ ) \\frac\\partial\\partial\\theta_jJ(\\theta) ∂θj∂J(θ),并且最终得到参数 θ \\theta θ的优化值。

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_过度拟合_12


You set a few options. This is a options as a data structure that stores the options you want. So grant up on, this sets the gradient objective parameter to on. It just means you are indeed going to provide a gradient to this algorithm. I’m going to set the maximum number of iterations to, let’s say, one hundred. We’re going give it an initial guess for theta. There’s a 2 by 1 vector.


optTheta %用来保存最后计算得到的参数值
functionVal %用来保存代价函数的计算值
exitFlag %用来表示最终是否收敛(1表示收敛)
@costFunction %表示调用函数costFunctioin
function [ jVal,gradient ] = costFunction( theta )
%此函数有两个返回值
%jVal 表示 cost function
%gradient 表示分别对两个参数的求导公式
jVal = (theta(1) - 5)^2 + (theta(2) - 5)^2;
gradient = zeros(2,1);
gradient(1) = 2 * (theta(1) - 5);
gradient(2) = 2 * (theta(2) - 5);
end
>> options = optimset(GradObj,on,MaxIter,100);
>> initialTheta = zeros(2,1);
>> [optTheta,functionVal,exitFlag]=fminunc(@costFunction,initialTheta,options)

因此,不管是在逻辑回归中还是线性回归中,只需要完成下图红色矩形中的内容即可。

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_classifica_13



7. 多分类(Multi-class classification: One-vs-all)

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_机器学习_14

Multi-class 简而言之就是 y y y的输出值不再是仅仅只有0和1了。而解决这一问题的思想就是,每次都把把training set 分成两部分,即One-vs-all。


One-vs-all
Train a logistic regression classifier h θ ( i ) ( x ) h_\\theta^(i)(x) hθ(i)​(x) for each class i i i to predict the probability that y = i y=i y=i.

On a new input x x x, to make a prediction, pick the class i i i that maximizes m a x h θ ( i ) ( x ) max \\h_\\theta^(i)(x)\\ maxhθ(i)​(x).


斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_线性回归_15

处理方法:

斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)_classifica_16


We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.


在解决这个问题的时候,我们根据图一,图二,图三的处理来训练三个分类器(classifier) h θ ( 1 ) ( x ) , h θ ( 2 ) ( x ) , h θ ( 3 ) ( x ) h_\\theta^(1)(x),h_\\theta^(2)(x),h_\\theta^(3)(x) hθ(1)(x),hθ(2)(x),hθ(3)(x) ,分表来输出 y = c a l s s 1 , y = c a l s s 2 , y = c a l s s 3 y=calss 1,y=calss 2,y=calss 3 y=calss1,y=calss2,y=calss3的概率。在输入一个新的 x x x后,分别在三个分类器中计算 y y y的值,然后选择其中最大的即可。



8. 过度拟合(Over fitting)

既然有过度拟合,那就可定有对应的欠拟合;简单的说过度拟合就是假设函数过于复杂,虽然他能完美地拟合training set 但却不能预测新的数据。这中现象

以上是关于斯坦福机器学习-第三周(分类,逻辑回归,过度拟合及解决方法)的主要内容,如果未能解决你的问题,请参考以下文章

斯坦福2014机器学习笔记五----正则化

斯坦福机器学习课程笔记

斯坦福吴恩达教授机器学习公开课第三讲笔记——局部加权回归/线性回归的概率解释/分类和逻辑回归

吴恩达机器学习笔记-第三周

机器学习:6.逻辑归回

Coursera-AndrewNg(吴恩达)机器学习笔记——第三周