Gaussian Processes regression
Posted eliker
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Gaussian Processes regression相关的知识,希望对你有一定的参考价值。
1. Gaussian processes
Gaussian processes are the extension of multivariate Gaussians to infinite-size collections real valued variables. In particular, this extension will allow us to think of Gaussians processes as distributions not just over random vectors but in fact distribution over random functions.
A gaussian process is completely specified by its mean function $m(cdot)$ and covariance function $k(cdot,cdot)$. if for any finite set of elements $x_1, cdot, x_m in mathcal{X}$, the associated finite set of random variables $f(x_1), cdots, f(x_m)$ have distribution:
egin{equation} left [ egin{array}{c} f(x_1) vdots f(x_m) end{array} ight ] sim mathcal N left ( left [ egin{array}{c} m(x_1) vdots m(x_m) end{array} ight ] , left [ egin{array}{ccc} k(x_1,x-1) & cdots & k(x_1,x_m) vdots & ddots & vdots k(x_m, x_1) & cdots & k(x_m, x_m) end{array} ight ] ight ) end{equation}
we denote this using the notation
$$ f(cdot) sim mathcal{GP}(m(cdot), k(cdot, cdot)) $$
Observe that the mean function and covariance function are aptly name since the above property imply that
$$ m(x) = mathbb E[x] $$
$$ k(x, x’) = mathbb E[(x-m(x))(x’-m(x’))] $$
for any $x, x’ in mathcal X$
In order to get a proper Gaussian processes, for any set of element $x_1, cdots, x_m in mathcal{X}$, the kernel matrix
egin{equation} K(X,X) = left [egin{array}{ccc} k(x_1,x_1) & cdots & k(x_1,x_m) vdots & ddots & vdots k(x_m, x_1) & cdots & k(x_m, x_m) end{array} ight ] end{equation}
is a valid covariance matrix corresponding to some multivariate Gaussian distribution. A standard results in probability theory states that this true provided that $K$ is positive semi-definite.
A simple example of Gaussian process can be obtained from our Bayesian linear regression model $f(x)=phi(x)^T w$ with prior $w sim mathcal N(0, Sigma_p)$. we have for the mean and covariance
egin{eqnarray} mathbb E[f(x)] &=& phi(x)^T mathbb E[w] mathbb E[(f(x)-0)(f(x’)-0)] &=& phi(x)^T mathbb E[ww^T] phi(x’) = phi(x)^TSigma_p phi(x’) end{eqnarray}
Thus $f(x)$ and $f(x’)$ are jointly Gaussian distribution with zero mean and covariance given by $phi(x)^TSigma_p phi(x’)$
2. The square exponential kernel
The square exponential kernel function, defined as
egin{equation} k_{SE}(x, x’) = ext{exp} (-frac{1}{2l^2} || x-x’ ||^2 ) label{SE_kernel} end{equation}
with parameter $l$ define the characteristic length-scale.
In our example, since we use a zero-mean Gaussian process, we would expect that for the function value from our Gaussian process will tend to be distribute around zero. Furthermore, for any pair of $x,x’ in mathcal X$.
a. $f(x)$ and $f(x’)$ will tend to have high covariance $x$ and $x’$ are “nearby” in the input space. (i.e. $||x-x’||=|x-x’| approx 0$, so $ ext{exp}(-frac{1}{2l^2} || x-x’ ||^2 ) approx 1 $.
b. $f(x)$ and $f(x’)$ will tend to have low covariance when $x$ and $x’$ are “far apart”. (i.e. $||x-x’||=|x-x’| gg 0$, so $ ext{exp}(-frac{1}{2l^2} || x-x’ ||^2 ) approx 0 $.
The specification of the covariance function implies a distribution over functions. To see this, we can draw samples from the distribution of functions evaluate at any number of input points, $X^*$ and write out the corresponding covariance matrix using ( ef{SE_kernel}). Then we generate a random Gaussian vector with covariance matrix
$$ f_* sim mathcal N(0, K(X_*, X_*)) $$
and plot the generated values as a function of the inputs
3. Gaussian processes regression
Let $S={(x^{(i)}, y^{(i)})}_{i=1}^{n}$ be a training set of i.i.d. examples from some unknow distribution. In the Gaussian processes regression model.
$$ y = f(x) + varepsilon $$
where the $varepsilon$ are i.i.d. “noise” variables with independent $mathcal N(0, sigma_n^2)$ distribution Like in Bayesian linear regression, we also assume a prior distribution over functions $f(cdot)$; in particular, we assume a zero-mean Gaussian process prior,
$$ f(cdot) sim mathcal{GP} (0, k(cdot,cdot)) $$
for some valid covariance function $k(cdot,cdot)$.
3.1 Prediction with Noise-free Observations
The joint distribution of the training output, $f$ and the test output $f_*$ according to the prior is
egin{equation} left [ egin{array}{c} f f_* end{array} ight ] sim mathcal N(0, left [ egin{array}{cc} K(X,X) & K(X, X_*) K(X_*, X) & K(X_*, X_*) end{array} ight ]) end{equation}
Corresponding to condition the joint Gaussian prior distribution on the observations to give the joint posterior distribution
egin{eqnarray} f_* | X_*, X, f sim & mathcal N & ( K(X_*,X) K(X,X)^{-1}f onumber & & K(X_*, X_*) - K(X_*,X) K(X,X)^{-1}K(X, X_*) ) end{eqnarray}
https://github.com/elike-ypq/Gaussian_Process/blob/master/Gassian_regression_no_noise.m
3.2 Prediction using Noisy Observations
The prior on the noisy observations becomes
$$ cov(y_p,y_q) = k(x_p,x_q) + sigma_n^2 delta_{pq} qquad ext{or} qquad cov(y) = K(X,X) + sigma_n^2 I $$
where $delta_{pq}$ is Kronecker delta which is one iff $p=q$ and zero otherwise. We can rewrite the joint distribution of observed target values and the function values at the test location under the prior as
egin{equation} left [ egin{array}{c} y f_* end{array} ight ] sim mathcal N(0, left [ egin{array}{cc} K(X,X) + sigma_n^2 I & K(X, X_*) K(X_*, X) & K(X_*, X_*) end{array} ight ]) end{equation}
Deriving the conditional distribution, we arrive at the key predictive equation for Gaussian process regression
egin{equation} f_* | X, y, X_* sim mathcal N(ar{f_*}, cov(f_*)) end{equation}
where
egin{eqnarray} ar{f_*} &=& mathbb E[f_* | X, y, X_* ] = K(X_*,X)[K(X,X)+sigma_n^2 I]^{-1} y cov(f_*) &=& K(X_*,X_*) – K(X_*,X) [K(X,X)+sigma_n^2 I]^{-1}K(X,X_*) end{eqnarray}
https://github.com/elike-ypq/Gaussian_Process/blob/master/Gassian_regression_with_noise.m
We can very simple compute the predictive distribution of test targets $y_*$ by adding $sigma_n^2 I$ to the variance in the expression for $cov(f_*)$.
4. Incorporating Explicit Basic Functions
It is common but by no means necessary to consider GPs with a zero mean function. The use of a explicit basis functions is a way to specify a none zero mean over functions.
Using a fixed (deterministic) mean function $m(x)$ is trivial: simply apply the usual zero mean GP to the difference between the observations and the fixed mean function. With
$$ f(x) sim mathcal N(m(x), k(x,x’)) $$
the prediction mean becomes
$$ ar{f_*} = mathbb E[f_* | X, y, X_* ] = m(X_*) + K(X_*,X)[K(X,X)+sigma_n^2 I]^{-1} (y-m(x)) $$
$$ cov(f_*) = K(X_*,X_*) – K(X_*,X) [K(X,X)+sigma_n^2 I]^{-1}K(X,X_*) $$
However, in practice it can often be difficult to specify a fixed mean function. In many case, it may more convenient to specify a few fixed basis function, whose coefficients, $eta$, are to be inferred from the data. Consider
egin{equation} g(x) = f(x) + h(x)^T eta qquad ext{where} qquad f(x) sim mathcal{GP} (0, k(x,x’)) label{GP_e1}end{equation}
here $f(x)$ is a zero mean GP, $h(x)$ are a set of fixed basis functions, and $eta$ are additional parameters. When fitting the model, one could optimize over parameters $eta$ jointly with the hyperparameter of the covariance function. Alternatively if we take the prior on $eta$ to Gaussian, $eta sim mathcal N(b, B)$, we can also integrate out this parameter. we obtain another GP
egin{equation} g(x) sim mathcal{GP} (h(x)^Tb, k(x,x’)+ h(x)^T B h(x)) label{GP_e2}end{equation}
now with an added contribution in the covariance function caused by uncertainty in the parameters of the mean. Prediction are made by plugging the mean and covariance functions of $g(x)$ into ( ef{GP_e1}) and ( ef{GP_e2}). After rearranging, we obtain
egin{eqnarray} ar{g_*} &=& H_*^Tar{eta} + K(X_*,X)[K(X,X)+sigma_n^2 I]^{-1} (y- H^Tar{eta}) = ar{f_*} + R^T ar{eta} cov(g_*) &=& cov(f_*) + R^T(B^{-1}+H K_y^{-1} H^T)^{-1} R end{eqnarray}
where the $H$ matrix collects the $h(x)$ vectors for all training cases, $ar{eta} = (B^{-1}+ H K_y^{-1} H^T)^{-1} (H K_y^{-1}y + B^{-1} b )$, and $R = H_* – HK_y^{-1}H_*$
以上是关于Gaussian Processes regression的主要内容,如果未能解决你的问题,请参考以下文章
不理解 ML 中的 plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')