Is it valid to compare the likelihood of different models in Gaussian process regression?

Question

When applying different kernel's through scikit-learn's Gaussian process regression, I observe certain instances with positive log-likelihood outputs which indicate a likelihood that is greater than 1. The relevant section of how scikit-learn computes the marginal likelihood is in lines 389 - 460 of the code here, which I have implemented in the below code for clarity. When applying a standard Matern kernel or a custom heteroscedastic kernel, I can observe a log-likelihood that is positive depending on my training data for the latter kernel.

Overall, if the marginal likelihood can be greater than 1, is it still valid/sensible to compare likelihood estimates for different models when applying different kernels (e.g. Matern, RBF, heteroscedastic)? The reason I ask is because one author states: "the best kernel can be selected by picking the one which maximizes the likelihood." Yet if the likelihood for different kernels have different upper bounds, why is it appropriate to compare the likelihoods for different kernels?

One can of course apply k-fold cross-validation to compare different models sensibly, but I am trying to understand the validity of comparing marginal likelihoods directly. This is touched upon here, but I am failing to see why it is appropriate to compare likelihoods (i.e. $a$ and $b$) from different kernels if they are not each individually normalized.

import numpy as np
from matplotlib import pyplot as plt 
from sklearn.gaussian_process import GaussianProcessRegressor 
from sklearn.gaussian_process.kernels import RBF, Matern, ConstantKernel as C
from mpl_toolkits.mplot3d import Axes3D
from scipy.linalg import cholesky, cho_solve, solve_triangular 
from gp_extras.kernels import HeteroscedasticKernel
from sklearn.cluster import KMeans

#  Example independent variable (observations)
X = np.array([[0.,0.], [1.,0.], [2.,0.], [3.,0.], [4.,0.], 
                [5.,0.], [6.,0.], [7.,0.], [8.,0.], [9.,0.], [10.,0.], 
                [11.,0.], [12.,0.], [13.,0.], [14.,0.],
                [0.,1.], [1.,1.], [2.,1.], [3.,1.], [4.,1.], 
                [5.,1.], [6.,1.], [7.,1.], [8.,1.], [9.,1.], [10.,1.], 
                [11.,1.], [12.,1.], [13.,1.], [14.,1.],
                [0.,2.], [1.,2.], [2.,2.], [3.,2.], [4.,2.], 
                [5.,2.], [6.,2.], [7.,2.], [8.,2.], [9.,2.], [10.,2.], 
                [11.,2.], [12.,2.], [13.,2.], [14.,2.]])#.T

# Example dependent variable (observations) - noiseless case 
y = np.array([4.0, 3.98, 4.01, 3.95, 3.9, 3.84,3.8,
              3.73, 2.7, 1.64, 0.62, 0.59, 0.3, 
              0.1, 0.1,
            4.4, 3.9, 4.05, 3.9, 3.5, 3.4,3.3,
              3.23, 2.6, 1.6, 0.6, 0.5, 0.32, 
              0.05, 0.02,
            4.0, 3.86, 3.88, 3.76, 3.6, 3.4,3.2,
              3.13, 2.5, 1.6, 0.55, 0.51, 0.23, 
              0.11, 0.01]) 

len_x1 = 20
len_x2 = 100
x1_min = 0
x2_min = 0
x1_max = 14
x2_max = 5
x1 = np.linspace(x1_min, x1_max, len_x1)
x2 = np.linspace(x2_min, x2_max, len_x2) 

i = 0 
inputs_x = []
while i < len(x1):
    j = 0
    while j < len(x2):
        inputs_x.append([x1[i],x2[j]])
        j = j + 1
    i = i + 1
inputs_x_array = np.array(inputs_x) 

prototypes = KMeans(n_clusters=8).fit(X).cluster_centers_
kernel = C(1.0, (1e-10, 1000)) * RBF(length_scale = [10., 100.], length_scale_bounds=[(1e-3, 1e3),(1e-4, 1e4)]) \
    + HeteroscedasticKernel.construct(prototypes, 1e-3, (1e-10, 50.0),
                                      gamma=1.0, gamma_bounds="fixed")

gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=100)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y.reshape(-1,1)) #removing reshape results in a different error

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(inputs_x_array, return_std=True)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(inputs_x_array[:,0],inputs_x_array[:,1],y_pred)
ax.scatter(X[:,0],X[:,1],y,color='orange')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
plt.show() 

print(gp.kernel_) #gives optimized hyperparameters 
print(gp.log_marginal_likelihood(gp.kernel_.theta)) #log-likelihood, matches with below


#Manually compute log-likelihood
alpha=1e-10
input_prediction = gp.predict(X,return_std=True)
K, K_gradient = gp.kernel_(X, eval_gradient=True)

K[np.diag_indices_from(K)] += alpha
L = cholesky(K, lower=True)  # Line 2
# Support multi-dimensional output of self.y_train_
if y.ndim == 1:
    y = y[:, np.newaxis]

alpha = cho_solve((L, True), y)
log_likelihood_dims = -0.5 * np.einsum("ik,ik->k", y, alpha)
log_likelihood_dims -= np.log(np.diag(L)).sum()
log_likelihood_dims -= (K.shape[0] / 2.) * np.log(2 * np.pi)
log_likelihood = log_likelihood_dims.sum(-1)
print(log_likelihood)

In[3]: np.linalg.eig(K)
Out[3]: 
(array([8.78443701e+01, 6.76108563e+01, 4.38415906e+01, 2.40626735e+01,
        1.12470581e+01, 4.50837332e+00, 1.56231380e+00, 4.74477243e-01,
        1.41372308e-01, 1.32231682e-01, 1.13233619e-01, 7.84739722e-02,
        4.92567420e-02, 4.15550455e-02, 3.09648758e-02, 2.93121622e-02,
        2.76453813e-02, 2.72588783e-02, 2.48037415e-02, 2.31885809e-02,
        2.23526761e-02, 2.02375375e-02, 1.83453095e-02, 3.33699549e-03,
        1.70944227e-02, 1.60381437e-02, 4.10656774e-03, 4.62208725e-03,
        5.41646496e-03, 6.50004488e-03, 1.44250326e-02, 1.36901158e-02,
        7.83641802e-03, 7.90280520e-03, 7.98359054e-03, 8.38240899e-03,
        9.17251925e-03, 9.69314998e-03, 1.02701587e-02, 1.07483982e-02,
        1.25997364e-02, 1.21642399e-02, 1.18388345e-02, 1.16249569e-02,
        1.17011001e-02]),
 array([[ 0.07023553, -0.13435774, -0.18677223, ..., -0.0014377 ,
          0.00174079, -0.00230936],
        [ 0.09682778, -0.16904703, -0.19771417, ...,  0.00619927,
         -0.01687944,  0.04077999],
        [ 0.12326951, -0.18963877, -0.16757871, ..., -0.01141089,
          0.0046866 ,  0.00704309],
        ...,
        [ 0.12319186,  0.18963063, -0.16763488, ...,  0.00271217,
         -0.06832765,  0.17895407],
        [ 0.09675482,  0.16900838, -0.19770022, ..., -0.00564892,
         -0.04467018,  0.12976537],
        [ 0.07017636,  0.13431011, -0.18671644, ...,  0.00503131,
          0.07055682, -0.20096559]]))

In[4]: K
Out[4]: 
array([[5.39391645e+00, 4.91377455e+00, 3.76948685e+00, ...,
        1.59029257e-05, 1.74596451e-06, 1.60634366e-07],
       [4.91377455e+00, 5.39140942e+00, 4.91377455e+00, ...,
        1.21384523e-04, 1.59029257e-05, 1.74596451e-06],
       [3.76948685e+00, 4.91377455e+00, 5.38696083e+00, ...,
        7.76415642e-04, 1.21384523e-04, 1.59029257e-05],
       ...,
       [1.59029257e-05, 1.21384523e-04, 7.76415642e-04, ...,
        5.37552310e+00, 4.91377455e+00, 3.76948685e+00],
       [1.74596451e-06, 1.59029257e-05, 1.21384523e-04, ...,
        4.91377455e+00, 5.37547535e+00, 4.91377455e+00],
       [1.60634366e-07, 1.74596451e-06, 1.59029257e-05, ...,
        3.76948685e+00, 4.91377455e+00, 5.37541417e+00]])

(The covariance matrix, K, is symmetric with all positive elements and eigenvalues for either kernel, i.e. positive semi-definite.)

Mathews24 · Accepted Answer · 2019-02-26T14:51:07.677

I believe I found the answer: the marginal likelihood (c.f. 5.4 in Rasmussen) is not over the hyperparameter, $\theta$, but over a separate parameter/variable, $w$ (c.f. 2.10 in Rasmussen), which is highly confusing. Therefore, it is not sensible to compare the marginal likelihood for different kernels as it appears in the above (scikit) code because it is yet to marginalize over $\theta$. I believe the author should be referring to a marginal likelihood over $\theta$ and not $w$ as well, but this is not explicitly nor analytically given in the above. And note: marginalizing over $\theta$ would give the optimal model (kernel), $M$, given the data, $D$, i.e. we should compare $P(D \lvert M)$ and not $P(D \lvert M, \theta)$ for different models (kernels). Happy to still receive input from others on this problem if they would like to add or confirm any information.

score 1 · Answer 2 · answered Sep 15 '20 at 10:37

A slightly more formal answer:

Let $X, y$ be your training data.

Let $k_a$, $k_b$ be two kernels you want to compare with parameter vectors $\theta_{k_a}$ and $\theta_{k_b}$.

Now consider the sklearn python function (simplified version of the second print statement in the question)

# X, y is the training data
X, y = make_regression()

def f(k):
    return GaussianProcessRegressor(kernel=k).fit(X, y).log_marginal_likelihood()

where k is a Kernel instance (e.g. k=RBF()).

Mathematically this is (see Rasmussen eq 5.8 for the full form of the likelihood) $$f(k) = \max_{\theta_k'} log P(Y|X, \theta_k', k)$$

That is the maximum marginal likelihood you can achieve on your training data when committed to using kernel k. Note that this might be positive since $y$ is continuous and hence point probabilities above 1 are possible.

This is not equal to (C.20) in the paper you cited. With this (C.20 in the paper), you compare the class of kernel $k_a$ with the class of kernel $k_b$, without going down to the level of a concrete parametrisation. This is probably the best way to decide which kernel class is better. But, marginalising over parameters $\theta$ is computationally difficult and also requires a prior over $\theta$, i.e. p(\theta_k | k). Both is not implemented in sklearn to my knowledge.

Coming back to what sklearn can do (i.e. $f(k)$ as defined in this answer). I still think it is sensible to choose your kernel according to $\max \{ f(k_a), f(k_b) \}$. It means that you compare 'the best' parametrisation of kernel A with 'the best' parameterisation of kernel B. I.e. you pick the best of these two concrete, parametrised kernels. To convince you that this is not an outrageous thing to do: Consider the composite kernel $k_c$ with parameters $\theta_{k_a}, \theta_{k_b}$ and $\gamma \in \{0, 1\}$.

$$ k_c(\cdot, \cdot) = \gamma k_a(\cdot, \cdot) + (1-\gamma) k_b(\cdot, \cdot) $$

Now picking between the two kernels (in the simple way) looks just like the normal max marginal likelihood approach of finding hyper parameters.

Is it valid to compare the likelihood of different models in Gaussian process regression?

2 Answers2