You can execute the whole notebook at
The audience of this tutorial is the one who wants to use GP but not feels comfortable using it. This happens to me after finishing reading the first two chapters of the textbook Gaussian Process for Machine Learning [1]. There is a gap between the usage of GP and feel comfortable using it due to the difficulties in understanding the theory. When I was reading the textbook and watching tutorial videos online, I can follow the majority without too many difficulties. The content kind of makes sense to me. But even when I am trying to talk to myself what GP is, the big picture is blurry. After keep trying to understand GP from various recourses, including textbooks, blog posts, and open-sourced codes, I get my understandings sorted and summarize them up from my perspective.
One thing I realized the difficulties in understanding GP is due to background varies, everyone has different knowledge. To understand GP, even to the intuitive level, needs to know multivariable Gaussian, kernel, conditional probability. If you familiar with these, start reading from Math. Entry or medium-level in deep learning (application level), without a solid understanding in machine learning theory, even cause more confusion in understanding GP.
First of all, why use Gaussian Process to do regression? Or even, what is regression? Regression is a common machine learning task that can be described as Given some observed data points (training dataset), finding a function that represents the dataset as close as possible, then using the function to make predictions at new data points. Regression can be conducted with polynomials, and it's common there is more than one possible function that fits the observed data. Besides getting predictions by the function, we also want to know how certain these predictions are. Moreover, quantifying uncertainty is super valuable to achieve an efficient learning process. The areas with the least certainty should be explored more.
In a word, GP can be used to make predictions at new data points and can tell us how certain these predictions are.
Let's talk about Gaussian.
A random variable
Here,
A n
number random sample points from a x
axis.
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# Plot 1-D gaussian
n = 1 # n number of independent 1-D gaussian
m= 1000 # m points in 1-D gaussian
f_random = np.random.normal(size=(n, m))
# more information about 'size': https://www.sharpsightlabs.com/blog/numpy-random-normal/
#print(f_random.shape)
for i in range(n):
#sns.distplot(f_random[i], hist=True, rug=True, vertical=True, color="orange")
sns.distplot(f_random[i], hist=True, rug=True)
plt.title('1 random samples from a 1-D Gaussian distribution')
plt.xlabel('x')
plt.ylabel('P(x)')
plt.show()
C:\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
We generated data points that follow the normal distribution. On the other hand, we can model data points, assume these points are Gaussian, model as a function, and do regression using it. As shown above, a kernel density and histogram of the generated points were estimated. The kernel density estimation looks a normal distribution due to there are plenty (m=1000)
observation points to get this Gaussian looking PDF. In regression, even we don't have that many observation data, we can model the data as a function that follows a normal distribution if we assume a Gaussian prior.
The Gaussian PDF
We have a random generated dataset in
Now, if we project all points
n = 1 # n number of independent 1-D gaussian
m= 1000 # m points in 1-D gaussian
f_random = np.random.normal(size=(n, m))
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
plt.clf()
plt.plot(Xshow, f_random, 'o', linewidth=1, markersize=1, markeredgewidth=2)
plt.xlabel('$X$')
plt.ylabel('$f(X)$')
plt.show()
It's clear that the vector
n = 2
m = 1000
f_random = np.random.normal(size=(n, m))
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
plt.clf()
plt.plot(Xshow, f_random, 'o', linewidth=1, markersize=1, markeredgewidth=2)
plt.xlabel('$X$')
plt.ylabel('$f(X)$')
plt.show()
Keep in mind that both vecotr
Let's do something interesting. Let's connect points of
n = 2
m = 10
f_random = np.random.normal(size=(n, m))
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
plt.clf()
plt.plot(Xshow, f_random, '-o', linewidth=2, markersize=4, markeredgewidth=2)
plt.xlabel('$X$')
plt.ylabel('$f(X)$')
plt.show()
Going back to think about regression. These lines look like functions for each pair of points. On the other hand, the plot also looks like we are sampling the region
n = 20
m = 10
f_random = np.random.normal(size=(n, m))
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
plt.clf()
plt.plot(Xshow, f_random, '-o', linewidth=1, markersize=3, markeredgewidth=2)
plt.xlabel('$X$')
plt.ylabel('$f(X)$')
plt.show()
Wait for a second, what we are trying to do by connecting random generated independent Gaussian points? Even these lines look like functions, but they are too noisy. If
Thus, functions by connecting independent Gaussian are not proper for regression, we need Gaussians that correlated to each other. How to describe joint Gaussian? Multivariable Gaussian.
In some situations, a system (set of data) has to be described by more than more feature variables
The
Goes to Appendix A if you want to generate image on the left.
Formally, multivariate Gaussian is expressed as [4]
The mean vector
The covariance matrix of
A
When we have an
Play around with the covariance matrix to see the correlations between the two Gaussians.
import pandas as pd
import seaborn as sns
mean, cov = [0., 0.], [(1., -0.6), (-0.6, 1.)]
data = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(data, columns=["x1", "x2"])
g = sns.jointplot("x1", "x2", data=df, kind="kde")
#(sns.jointplot("x1", "x2", data=df).plot_joint(sns.kdeplot))
g.plot_joint(plt.scatter, c="g", s=30, linewidth=1, marker="+")
#g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$x1$", "$x2$");
g.ax_joint.legend_.remove()
plt.show()
Another good MVN visualization is Multivariante Gaussians and Mixtures of Gaussians (MoG).
We want to smooth the sampling functions by defining the covariance functions. Considering the fact that when two vectors are similar, their dot product output value is high. It is very clear to see this in the dot product equation
A popular covariance function (aka kernel function) is squared exponential kernal, also called the radial basis function (RBF) kernel or Gaussian kernel, defined as
Let's re-plot 20 independent Gaussian and connecting points in order by lines. Instead of generating 20 independent Gaussian before, we do the plot of a
n = 20
m = 10
mean = np.zeros(n)
cov = np.eye(n)
f_prior = np.random.multivariate_normal(mean, cov, m).T
plt.clf()
#plt.plot(Xshow, f_prior, '-o')
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
for i in range(m):
plt.plot(Xshow, f_prior, '-o', linewidth=1)
plt.title('10 samples of the 20-D gaussian prior')
plt.show()
We got exactly the same plot as expected. Now let's kernelizing our funcitons by use the RBF as our convariace.
# Define the kernel
def kernel(a, b):
sqdist = np.sum(a**2,axis=1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)
# np.sum( ,axis=1) means adding all elements columnly; .reshap(-1, 1) add one dimension to make (n,) become (n,1)
return np.exp(-.5 * sqdist)
n = 20
m = 10
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
K_ = kernel(Xshow, Xshow) # k(x_star, x_star)
mean = np.zeros(n)
cov = np.eye(n)
f_prior = np.random.multivariate_normal(mean, K_, m).T
plt.clf()
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
for i in range(m):
plt.plot(Xshow, f_prior, '-o', linewidth=1)
plt.title('10 samples of the 20-D gaussian kernelized prior')
plt.show()
We get much smoother lines and looks even more like functions. When the dimension of Gaussian gets larger, there is no need to connect points. When the dimension become infinity, there is a point represents any possible input. Let's plot m=200
samples of n=200
n = 200
m = 200
Xshow = np.linspace(0, 1, n).reshape(-1,1)
K_ = kernel(Xshow, Xshow) # k(x_star, x_star)
mean = np.zeros(n)
cov = np.eye(n)
f_prior = np.random.multivariate_normal(mean, K_, m).T
plt.clf()
#plt.plot(Xshow, f_prior, '-o')
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
plt.figure(figsize=(18,9))
for i in range(m):
plt.plot(Xshow, f_prior, 'o', linewidth=1, markersize=2, markeredgewidth=1)
plt.title('200 samples of the 200-D gaussian kernelized prior')
#plt.axis([0, 1, -3, 3])
plt.show()
#plt.savefig('priorT.png', bbox_inches='tight', dpi=300)
<Figure size 432x288 with 0 Axes>
As we can see above, when we increase the dimension of Gaussian to infinity, we can sample all the possible points in our region of interest.
Here we talk a little bit about Parametric and Nonparametric model. You can skip this section without compromising your Gaussian Process understandings.
Parametric models assume that the data distribution can be modeled in terms of a set of finite number parameters. For regression, we have some data points, and we would like to make predictions of the value of
To generate correlated normally distributed random samples, one can first generate uncorrelated samples, and then multiply them
by a matrix L such that
n = 20
m = 10
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
K_ = kernel(Xshow, Xshow)
L = np.linalg.cholesky(K_ + 1e-6*np.eye(n))
f_prior = np.dot(L, np.random.normal(size=(n,m)))
plt.clf()
plt.plot(Xshow, f_prior, '-o')
plt.title('10 samples of the 20-D gaussian kernelized prior')
plt.show()
First, again, going back to our task regression. There is a function
Second, let's review what we got from MVN. We got the feeling that when the dimension of Gaussian is infinite, we can sample all the region of interest with random functions. These infinite random functions are MVN because it's our assumption (prior). More formally, the prior distribution of these infinite random functions are MVN. The prior distribution representing the kind out outputs
When we have observation points, instead of infinite random functions, we only keep functions that are fit these points. Now we got our posterior, the current belief based on the existing observations. When we have more observation points, we use our previous posterior as our prior, use these new observations to update our posterior.
This is Gaussian process.
A Gaussian process is a probability distribution over possible functions that fit a set of points.
Because we have the probability distribution over all possible functions, we can caculate the means as the function, and caculate the variance to show how confidient when we make predictions using the function.
Keep in mind,
- The functions(posterior) updates with new observations.
- The mean calcualted by the posterior distribution of the possible functions is the function used for regression.
Highly recommend to read Appendix A.1 and A.2 [3] before continue. Basic math.
The function is modeled by a multivarable Gaussian as
where
So, we have observations, and we have estimated functions
The joint distribution of
$$ \begin{pmatrix}\mathbf{f} \ \mathbf{f}*\end{pmatrix} \sim\mathcal{N}\left(\begin{pmatrix}m(\mathbf{X})\ m(\mathbf{X})\end{pmatrix}, \begin{pmatrix}\mathbf{K} & \mathbf{K}_ \ \mathbf{K}*^T & \mathbf{K}{**}\end{pmatrix}\right) $$
where
This is modeling a joint distribution $p(\mathbf{f}, \mathbf{f}* , \vert , \mathbf{X}, \mathbf{X})$, but we want the conditional distribution over $\mathbf{f}_$ only, which is $p(\mathbf{f}* , \vert , \mathbf{f}, \mathbf{X}, \mathbf{X})$. The derivation process from the joint distribution $p(\mathbf{f}, \mathbf{f}_ , \vert , \mathbf{X}, \mathbf{X}*)$ to the conditional $p(\mathbf{f}* , \vert , \mathbf{f}, \mathbf{X}, \mathbf{X}_*)$ uses the Marginal and conditional distributions of MVN theorem [5].
We got eqn. 2.19 [1] $$\mathbf{f}* , \vert , \mathbf{f}, \mathbf{X}, \mathbf{X}* \sim \mathcal{N} (\mathbf{K}*^T \mathbf{K}^{-1} \mathbf{f}, : \mathbf{K}{**}-\mathbf{K}*^T \mathbf{K}^{-1} \mathbf{K}*) $$
It is realistic modelling situations that we do not have access to function values themselves, but only noisy versions thereof
$$ \begin{pmatrix}\mathbf{y} \ \mathbf{f}*\end{pmatrix} \sim\mathcal{N}\left(\mathbf{0}, \begin{pmatrix}\mathbf{K} + \sigma_n^2\mathbf{I} & \mathbf{K}* \ \mathbf{K}*^T & \mathbf{K}{**}\end{pmatrix}\right) $$ Deriving the conditional distribution corresponding to eqn. 2.19 we get the predictive equations (eqn. 2.22, eqn. 2.23, and eqn. 2.24) [1] for Gaussian process regression as
$$\mathbf{\bar{f}*} , \vert , \mathbf{X}, \mathbf{y}, \mathbf{X}* \sim \mathcal{N} \left(\mathbf{\bar{f}*}, cov(\mathbf{f}*)\right) $$
where, $$\mathbf{\bar{f}*} \overset{\Delta}{=} \mathbb{E} [\mathbf{\bar{f}} , \vert , \mathbf{X}, \mathbf{y}, \mathbf{X}_] = \mathbf{K}_*^T [\mathbf{K} + \sigma_y^2\mathbf{I}]^{-1} \mathbf{y} $$
$$cov(\mathbf{f}*) = \mathbf{K}{**} - \mathbf{K}*^T [\mathbf{K} + \sigma_y^2\mathbf{I}]^{-1} \mathbf{K}* $$
We do the regression example between -5 and 5. The observation data points (traing dataset) are generated from a uniform distribution between -5 and 5. This means any point value within the given interval [-5, 5] is equally likely to be drawn by uniform. The functions will be evaluated at n
evenly spaced points between -5 and 5. We do this to show a continuous function for regression in our region of interest [-5, 5]. This is a simple example to do GP regression. It assumes a zero mean GP Prior. The code borrows heavily from Dr. Nando de Freitas’ Gaussian processes for nonlinear regression lecture [6].
The algorithm executed follows
The textbook GPML, P19. [1]
Dr. Nando de Freitas, Introduction to Gaussian processes. [6]
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# This is the true unknown function we are trying to approximate
f = lambda x: np.sin(0.9*x).flatten()
#f = lambda x: (0.25*(x**2)).flatten()
x = np.arange(-5, 5, 0.1)
plt.plot(x, f(x))
plt.axis([-5, 5, -3, 3])
plt.show()
# Define the kernel
def kernel(a, b):
kernelParameter_l = 0.1
kernelParameter_sigma = 1.0
sqdist = np.sum(a**2,axis=1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)
# np.sum( ,axis=1) means adding all elements columnly; .reshap(-1, 1) add one dimension to make (n,) become (n,1)
return kernelParameter_sigma*np.exp(-.5 * (1/kernelParameter_l) * sqdist)
We use a general Squared Exponential Kernel, also called Radial Basis Function Kernel or Gaussian Kernel:
where
# Sample some input points and noisy versions of the function evaluated at
# these points.
N = 20 # number of existing observation points (training points).
n = 200 # number of test points.
s = 0.00005 # noise variance.
X = np.random.uniform(-5, 5, size=(N,1)) # N training points
y = f(X) + s*np.random.randn(N)
K = kernel(X, X)
L = np.linalg.cholesky(K + s*np.eye(N)) # line 1
# points we're going to make predictions at.
Xtest = np.linspace(-5, 5, n).reshape(-1,1)
# compute the mean at our test points.
Lk = np.linalg.solve(L, kernel(X, Xtest)) # k_star = kernel(X, Xtest), calculating v := l\k_star
mu = np.dot(Lk.T, np.linalg.solve(L, y)) # \alpha = np.linalg.solve(L, y)
# compute the variance at our test points.
K_ = kernel(Xtest, Xtest) # k(x_star, x_star)
s2 = np.diag(K_) - np.sum(Lk**2, axis=0)
s = np.sqrt(s2)
# PLOTS:
plt.figure(1)
plt.clf()
plt.plot(X, y, 'k+', ms=18)
plt.plot(Xtest, f(Xtest), 'b-')
plt.gca().fill_between(Xtest.flat, mu-2*s, mu+2*s, color="#dddddd")
plt.plot(Xtest, mu, 'r--', lw=2)
#plt.savefig('predictive.png', bbox_inches='tight', dpi=300)
plt.title('Mean predictions plus 2 st.deviations')
plt.show()
#plt.axis([-5, 5, -3, 3])
# draw samples from the posterior at our test points.
L = np.linalg.cholesky(K_ + 1e-6*np.eye(n) - np.dot(Lk.T, Lk))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,40))) # size=(n, m), m shown how many posterior
plt.figure(3)
plt.clf()
plt.figure(figsize=(18,9))
plt.plot(X, y, 'k+', markersize=20, markeredgewidth=3)
plt.plot(Xtest, mu, 'r--', linewidth=3)
plt.plot(Xtest, f_post, linewidth=0.8)
plt.title('40 samples from the GP posterior, mean prediction function and observation points')
plt.show()
#plt.axis([-5, 5, -3, 3])
#plt.savefig('post.png', bbox_inches='tight', dpi=600)
<Figure size 432x288 with 0 Axes>
We plotted m=40
samples from the Gaussian Process posterior together with the mean function for prediction and the observation data points (training dataset). It's clear all posterior functions collapse at all observation points.
The general RBF kernel:
where
More complex kernel functions can be selected to depend on the specific tasks. More information about choosing the kernel/covariance function for a Gaussian process can be found in The Kernel Cookbook
[8].
A Gaussian process (GP) is a probability distribution over possible functions that fit a set of points. [1] GPs are nonparametric models that model the function directly. Thus, GP provides a distribution (with uncertainty) for the prediction value rather than just one value as the prediction. In robot learning, quantifying uncertainty can be extremely valuable to achieve an efficient learning process. The areas with least certain should be explored next. This is the main idea behind Bayesian optimization. [9] Moreover, prior knowledge and specifications about the shape of the model can be added by selecting different kernel functions. [1] Priors can be specified based on criteria including if the model is smooth, if it is sparse, if it is able to change drastically, and if it need to be differentiable.
-
For simplicity and understanding reason, I ignore many math and technical talks. Read the first two chapters of the textbook
Gaussian Process for Machine Learning
[1] serveral times to get a solid understanding of GPR. Such as **Gaussian process regression is a linear smoother. ** -
One of most tricky part in understanding GP is the mapping projection among spaces. From input space to latent (feature) space and back to output space. You can get some feeling about space by reading
autoencoder
.
[1] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. MIT Press, 2006.
[2] R. Turner, “ML Tutorial: Gaussian Processes - YouTube,” 2017. [Online]. Available: https://www.youtube.com/watch?v=92-98SYOdlY&feature=emb_title.
[3] A. Ng, “Multivariate Gaussian Distribution - Stanford University | Coursera,” 2015. [Online]. Available: https://www.coursera.org/learn/machine-learning/lecture/Cf8DF/multivariate-gaussian-distribution.
[4] D. Lee, “Multivariate Gaussian Distribution - University of Pennsylvania | Coursera,” 2017. [Online]. Available: https://www.coursera.org/learn/robotics-learning/lecture/26CFf/1-3-1-multivariate-gaussian-distribution.
[5] F. Dai, Machine Learning Cheat Sheet: Classical equations and diagrams in machine learning. 2017.
[6] N. de Freitas, “Machine learning - Introduction to Gaussian processes - YouTube,” 2013. [Online]. Available: https://www.youtube.com/watch?v=4vGiHC35j9s&t=1424s.
[7] Y. Shi, “Gaussian Process, not quite for dummies,” 2019. [Online]. Available: https://yugeten.github.io/posts/2019/09/GP/.
[8] D. Duvenaud, “Kernel Cookbook,” 2014. [Online]. Available: https://www.cs.toronto.edu/~duvenaud/cookbook/.
[9] Y. Gal, “What my deep model doesn’t know.,” 2015. [Online]. Available: http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html.
Visualizing 3D plots of a
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
# Our 2-dimensional distribution will be over variables X and Y
N = 60
X = np.linspace(-3, 3, N)
Y = np.linspace(-3, 4, N)
X, Y = np.meshgrid(X, Y)
# Mean vector and covariance matrix
mu = np.array([0., 1.])
Sigma = np.array([[ 1. , 0.8], [0.8, 1.]])
# Pack X and Y into a single 3-dimensional array
pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X
pos[:, :, 1] = Y
def multivariate_gaussian(pos, mu, Sigma):
"""Return the multivariate Gaussian distribution on array pos.
pos is an array constructed by packing the meshed arrays of variables
x_1, x_2, x_3, ..., x_k into its _last_ dimension.
"""
n = mu.shape[0]
Sigma_det = np.linalg.det(Sigma)
Sigma_inv = np.linalg.inv(Sigma)
N = np.sqrt((2*np.pi)**n * Sigma_det)
# This einsum call calculates (x-mu)T.Sigma-1.(x-mu) in a vectorized
# way across all the input variables.
fac = np.einsum('...k,kl,...l->...', pos-mu, Sigma_inv, pos-mu)
return np.exp(-fac / 2) / N
# The distribution on the variables X, Y packed into pos.
Z = multivariate_gaussian(pos, mu, Sigma)
# Create a surface plot and projected filled contour plot under it.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, Z, rstride=3, cstride=3, linewidth=1, antialiased=True,
cmap=cm.viridis)
cset = ax.contourf(X, Y, Z, zdir='z', offset=-0.2, cmap=cm.viridis)
# Adjust the limits, ticks and view angle
ax.set_zlim(-0.2,0.2)
ax.set_zticks(np.linspace(0,0.2,5))
ax.view_init(30, -100)
ax.set_xlabel(r'$x_1$')
ax.set_ylabel(r'$x_2$')
ax.set_zlabel(r'$P(x_1, x_2)$')
plt.title('mean, cov = [0., 1.], [(1., 0.8), (0.8, 1.)]')
plt.savefig('2d_gaussian3D_0.8.png', dpi=600)
plt.show()