CS 229 notes Supervised Learning



CS 229 notes Supervised Learning

the proof of Normal equation and, before that, some linear algebra equations, which will be used in the proof.

The normal equation

Linear algebra preparation

For two matrices A and B such that AB is square, trAB\\ = \\ trBA.




Some properties:


some facts of matrix derivative:



\\nabla_{A^T}f(A) = (\\nabla_Af(A))^T...........................................................(2)
\\nabla_AtrABA^TC = CAB+C^TAB^T..................................................(3)

Proof 1:


Proof 2:


\\nabla_A|A| = |A|(A^{-1})^T.............................................................(4)

Proof: (\\nabla_A |A|)_{pq} = C_{pq} = A^*_{qp} = (A^*)^T_{pq} = |A|(A^{-1})_{pq}
(C refers to the cofactor)

Least squares revisited

X = \\begin{bmatrix}-(x^{(1)})^T-\\\\-(x^{(2)})^T-\\\\.\\\\.\\\\.\\\\-(x^{(m)})^T-\\end{bmatrix}(if we don’t include the intercept term)

\\vec y = \\begin{bmatrix}y^{(1)}\\\\y^{(2)}\\\\.\\\\.\\\\.\\\\y^{(m)}\\end{bmatrix}

since h_\\theta(x^{(i)} = (x^{(i)})^T\\theta,

$\\frac{1}{2}(X\\theta-\\vec{y})^T(X\\theta-\\vec{y}) =
\\frac{1}{2}\\displaystyle{\\sum{i=1}^{m}(h\\theta(x^{(i)}) -y^{(i)})^2} = J(\\theta) $.

Combine Equations (2),(3)
\\nabla_{A^T}trABA^TC = B^TA^TC^T+BA^TC..............................................(5)


\\nabla_\\theta J(\\theta) = \\frac{1}{2}\\nabla_\\theta(X\\theta-\\vec{y})^T(X\\theta-\\vec{y})\\\\
 = \\frac{1}{2}\\nabla_\\theta(\\theta^TX^TX\\theta-\\theta^TX^T\\vec{y}-\\vec{y}X\\theta -({\\vec{y}})^T\\vec{y})

Notice it is a real number, or you can see it as a 1\\times 1 matrix, so


since trA = trA^T and \\vec y involves no \\theta elements.
then use equation (5) with A^T = \\theta, B = B^T = X^TX, C = I


To minmize J, we set its derivative to zero, and obtain the normal equation:
X^TX\\theta = X^T\\vec{y}
\\theta = (X^TX)^{-1}X^T\\vec{y}

