When applying different kernel's through scikit-learn's Gaussian process regression, I observe certain instances with positive log-likelihood outputs which indicate a likelihood that is greater than 1. The relevant section of how scikit-learn computes the marginal likelihood is in lines 389 - 460 of the code here, which I have implemented in the below code for clarity. When applying a standard Matern kernel or a custom heteroscedastic kernel, I can observe a log-likelihood that is positive depending on my training data for the latter kernel.
Overall, if the marginal likelihood can be greater than 1, is it still valid/sensible to compare likelihood estimates for different models when applying different kernels (e.g. Matern, RBF, heteroscedastic)? The reason I ask is because one author states: "the best kernel can be selected by picking the one which maximizes the likelihood." Yet if the likelihood for different kernels have different upper bounds, why is it appropriate to compare the likelihoods for different kernels?
One can of course apply k-fold cross-validation to compare different models sensibly, but I am trying to understand the validity of comparing marginal likelihoods directly. This is touched upon here, but I am failing to see why it is appropriate to compare likelihoods (i.e. $a$ and $b$) from different kernels if they are not each individually normalized.
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, ConstantKernel as C
from mpl_toolkits.mplot3d import Axes3D
from scipy.linalg import cholesky, cho_solve, solve_triangular
from gp_extras.kernels import HeteroscedasticKernel
from sklearn.cluster import KMeans
# Example independent variable (observations)
X = np.array([[0.,0.], [1.,0.], [2.,0.], [3.,0.], [4.,0.],
[5.,0.], [6.,0.], [7.,0.], [8.,0.], [9.,0.], [10.,0.],
[11.,0.], [12.,0.], [13.,0.], [14.,0.],
[0.,1.], [1.,1.], [2.,1.], [3.,1.], [4.,1.],
[5.,1.], [6.,1.], [7.,1.], [8.,1.], [9.,1.], [10.,1.],
[11.,1.], [12.,1.], [13.,1.], [14.,1.],
[0.,2.], [1.,2.], [2.,2.], [3.,2.], [4.,2.],
[5.,2.], [6.,2.], [7.,2.], [8.,2.], [9.,2.], [10.,2.],
[11.,2.], [12.,2.], [13.,2.], [14.,2.]])#.T
# Example dependent variable (observations) - noiseless case
y = np.array([4.0, 3.98, 4.01, 3.95, 3.9, 3.84,3.8,
3.73, 2.7, 1.64, 0.62, 0.59, 0.3,
0.1, 0.1,
4.4, 3.9, 4.05, 3.9, 3.5, 3.4,3.3,
3.23, 2.6, 1.6, 0.6, 0.5, 0.32,
0.05, 0.02,
4.0, 3.86, 3.88, 3.76, 3.6, 3.4,3.2,
3.13, 2.5, 1.6, 0.55, 0.51, 0.23,
0.11, 0.01])
len_x1 = 20
len_x2 = 100
x1_min = 0
x2_min = 0
x1_max = 14
x2_max = 5
x1 = np.linspace(x1_min, x1_max, len_x1)
x2 = np.linspace(x2_min, x2_max, len_x2)
i = 0
inputs_x = []
while i < len(x1):
j = 0
while j < len(x2):
inputs_x.append([x1[i],x2[j]])
j = j + 1
i = i + 1
inputs_x_array = np.array(inputs_x)
prototypes = KMeans(n_clusters=8).fit(X).cluster_centers_
kernel = C(1.0, (1e-10, 1000)) * RBF(length_scale = [10., 100.], length_scale_bounds=[(1e-3, 1e3),(1e-4, 1e4)]) \
+ HeteroscedasticKernel.construct(prototypes, 1e-3, (1e-10, 50.0),
gamma=1.0, gamma_bounds="fixed")
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=100)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y.reshape(-1,1)) #removing reshape results in a different error
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(inputs_x_array, return_std=True)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(inputs_x_array[:,0],inputs_x_array[:,1],y_pred)
ax.scatter(X[:,0],X[:,1],y,color='orange')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
plt.show()
print(gp.kernel_) #gives optimized hyperparameters
print(gp.log_marginal_likelihood(gp.kernel_.theta)) #log-likelihood, matches with below
#Manually compute log-likelihood
alpha=1e-10
input_prediction = gp.predict(X,return_std=True)
K, K_gradient = gp.kernel_(X, eval_gradient=True)
K[np.diag_indices_from(K)] += alpha
L = cholesky(K, lower=True) # Line 2
# Support multi-dimensional output of self.y_train_
if y.ndim == 1:
y = y[:, np.newaxis]
alpha = cho_solve((L, True), y)
log_likelihood_dims = -0.5 * np.einsum("ik,ik->k", y, alpha)
log_likelihood_dims -= np.log(np.diag(L)).sum()
log_likelihood_dims -= (K.shape[0] / 2.) * np.log(2 * np.pi)
log_likelihood = log_likelihood_dims.sum(-1)
print(log_likelihood)
In[3]: np.linalg.eig(K)
Out[3]:
(array([8.78443701e+01, 6.76108563e+01, 4.38415906e+01, 2.40626735e+01,
1.12470581e+01, 4.50837332e+00, 1.56231380e+00, 4.74477243e-01,
1.41372308e-01, 1.32231682e-01, 1.13233619e-01, 7.84739722e-02,
4.92567420e-02, 4.15550455e-02, 3.09648758e-02, 2.93121622e-02,
2.76453813e-02, 2.72588783e-02, 2.48037415e-02, 2.31885809e-02,
2.23526761e-02, 2.02375375e-02, 1.83453095e-02, 3.33699549e-03,
1.70944227e-02, 1.60381437e-02, 4.10656774e-03, 4.62208725e-03,
5.41646496e-03, 6.50004488e-03, 1.44250326e-02, 1.36901158e-02,
7.83641802e-03, 7.90280520e-03, 7.98359054e-03, 8.38240899e-03,
9.17251925e-03, 9.69314998e-03, 1.02701587e-02, 1.07483982e-02,
1.25997364e-02, 1.21642399e-02, 1.18388345e-02, 1.16249569e-02,
1.17011001e-02]),
array([[ 0.07023553, -0.13435774, -0.18677223, ..., -0.0014377 ,
0.00174079, -0.00230936],
[ 0.09682778, -0.16904703, -0.19771417, ..., 0.00619927,
-0.01687944, 0.04077999],
[ 0.12326951, -0.18963877, -0.16757871, ..., -0.01141089,
0.0046866 , 0.00704309],
...,
[ 0.12319186, 0.18963063, -0.16763488, ..., 0.00271217,
-0.06832765, 0.17895407],
[ 0.09675482, 0.16900838, -0.19770022, ..., -0.00564892,
-0.04467018, 0.12976537],
[ 0.07017636, 0.13431011, -0.18671644, ..., 0.00503131,
0.07055682, -0.20096559]]))
In[4]: K
Out[4]:
array([[5.39391645e+00, 4.91377455e+00, 3.76948685e+00, ...,
1.59029257e-05, 1.74596451e-06, 1.60634366e-07],
[4.91377455e+00, 5.39140942e+00, 4.91377455e+00, ...,
1.21384523e-04, 1.59029257e-05, 1.74596451e-06],
[3.76948685e+00, 4.91377455e+00, 5.38696083e+00, ...,
7.76415642e-04, 1.21384523e-04, 1.59029257e-05],
...,
[1.59029257e-05, 1.21384523e-04, 7.76415642e-04, ...,
5.37552310e+00, 4.91377455e+00, 3.76948685e+00],
[1.74596451e-06, 1.59029257e-05, 1.21384523e-04, ...,
4.91377455e+00, 5.37547535e+00, 4.91377455e+00],
[1.60634366e-07, 1.74596451e-06, 1.59029257e-05, ...,
3.76948685e+00, 4.91377455e+00, 5.37541417e+00]])
(The covariance matrix, K
, is symmetric with all positive elements and eigenvalues for either kernel, i.e. positive semi-definite.)