1

The variance of the derivative of a Gaussian process, $f$, is given by (9.1):

$$ Var(\frac{\partial f}{\partial x}) =\frac {\partial ^2 k(x,x)}{\partial x^2},$$

where $k(·, ·)$ is both a positive-definite quantity and the covariance function of $f$. But when evaluating the error corresponding to $\frac{\partial f}{\partial x}$, we observe that it is instead given by $\frac {\partial ^2 k(x,x)}{\partial x^2}$, which is not necessarily positive everywhere. Therefore is the above definition for $Var(\frac{\partial f}{\partial x})$ actually correct? Is it valid to simply take the absolute value of this quantity when computing the error or should this variance be handled differently?

Example:

import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

def f(x):
    """The function to predict."""
    return 1.5*(1. - np.tanh(100.*(x-0.96))) + 1.5*x*(x-0.95) + 0.4 + 1.5*(1.-x)* np.random.random(x.shape)

# Instantiate a Gaussian Process model
kernel = C(10.0, (1e-5, 1e5)) * RBF(10.0, (1e-5, 1e5))

X = np.array([0.803,0.827,0.861,0.875,0.892,0.905,
                0.91,0.92,0.925,0.935,0.941,0.947,0.96,
                0.974,0.985,0.995,1.0])
X = np.atleast_2d(X).T

# Observations and noise
y = f(X).ravel() 
noise = np.linspace(0.4,0.3,len(X))
y += noise

# Instantiate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel, alpha=noise ** 2,
                              n_restarts_optimizer=10)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)

# Make the prediction on the meshed x-axis (ask for MSE as well)
x = np.atleast_2d(np.linspace(0.8, 1.02, 1000)).T
y_pred, sigma = gp.predict(x, return_std=True)

plt.figure() 
plt.errorbar(X.ravel(), y, noise, fmt='k.', markersize=10, label=u'Observations')
plt.plot(x, y_pred, 'k-', label=u'Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
         np.concatenate([y_pred - 1.9600 * sigma,
                        (y_pred + 1.9600 * sigma)[::-1]]),
         alpha=.1, fc='k', ec='None', label='95% confidence interval')
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(0.8, 1.02)
plt.ylim(0, 5)
plt.legend(loc='lower left')
plt.show()

dkdx = np.gradient(sigma**2.,x[:,0])
d2kdx2 = np.gradient(dkdx,x[:,0])

plt.figure()
plt.plot(x, sigma**2., 'k-') 
plt.ylabel('k')
plt.xlabel('x') 
plt.show()

plt.figure()
plt.plot(x, dkdx, 'r-')
plt.ylabel('dk/dx')
plt.xlabel('x') 
plt.show()

plt.figure()
plt.plot(x, d2kdx2, 'g-')
plt.ylabel('d2k/dx2')
plt.xlabel('x') 
plt.show()

enter image description here enter image description here enter image description here enter image description here

As a simple analytic case, if we consider a modified squared exponential kernel centered at $x_a$ and $x_b$, then $k(x_i,x_j) = \exp(-(x_i-x_a)^2 - (x_j-x_b)^2)$. This is positive definite. But $\frac {\partial k(x_i,x_j)}{\partial x_i} = -2(x_i - x_a) k(x_i, x_j)$ and $\frac {\partial k(x_i,x_j)}{\partial x_i \partial x_j} = 4(x_i - x_a)(x_j - x_b) k(x_i, x_j)$, which can both possibly be negative.

As you may note, it appears $k$ is always greater than zero, but ${\partial ^2 k(x,x)}/{\partial x^2}$ and ${\partial k(x,x)}/{\partial x}$ can still be negative.

Mathews24
  • 417
  • 4
  • 20
  • 2
    Could you clarify what the ambiguous notation "$\frac {\partial ^2 k(x,x)}{\partial x^2}$" means? – whuber Apr 10 '19 at 20:42
  • In general, the covariance is given by $Cov(f_i,f_j) = k(x_i,x_j)$. What I am simply trying to express above is the second derivative of the diagonal of the covariance matrix, i.e. a point on the diagonal is given by $k(x_i,x_i)$ and the second derivative of this quantity would be $\frac {\partial ^2 k(x_i,x_i)}{\partial x_i^2}$. I simply took off the subscript above, but this notation should be consistent with that given by (9.1). – Mathews24 Apr 11 '19 at 03:40
  • That misinterpretation of the formula could be the source of your problem. Regardless, what is your basis for asserting this derivative is "not necessarily positive everywhere"? Positive-definiteness implies that the terms actually defined by (9.1) *are* positive. – whuber Apr 11 '19 at 13:04
  • @whuber I have added an example code and plots to demonstrate that the derivatives are not necessarily positive everywhere. Positive definiteness applies to $k(x_i,x_j)$, but it does not appear that this extends directly to its derivatives, hence I am wondering if there is a modification to (9.1) in practice (e.g. take its absolute value). – Mathews24 Apr 11 '19 at 14:16
  • Thank you for working to clarify the situation. Unfortunately, all kinds of things could be going wrong with this code. For instance, as far as I can tell, your plots are fitting smoothing splines through fairly coarsely sampled data. How can we tell that where the second derivative plot dips below zero isn't just an artifact of the plotting mechanism? – whuber Apr 11 '19 at 14:23
  • @whuber The coarseness of the data is immaterial. It's the coarseness of my predictions that matter (which is currently set to 1000 intervals...but you could change this to a million or billion in the code and still get negative $dk/dx$ and $d^2k/dx^2$). You could simply change the coarseness of the data, which I have also tested, and negative values for the derivatives are once again possible to be observed. It's not simply an artifact--there is no constraint on $k(x,x)$ such that it must be monotonically increasing or void of changes in concavity, which is why I am questioning (9.1). – Mathews24 Apr 11 '19 at 14:47
  • @whuber I have now also added a simple analytic example. – Mathews24 Apr 11 '19 at 19:32
  • If you're going to ask the same question simultaneously on two different forums, you should give the associated link: https://math.stackexchange.com/questions/3184165/handling-negative-variances-on-the-derivative-of-gaussian-processes. Otherwise you can be wasting folks' time if hints or answers appear on the other site. – JimB Apr 11 '19 at 21:42
  • Maybe I'm misinterpreting your analytic example $k$ but are you sure it describes a legitimate covariance structure? I get a correlation of 1 for all $i,j$ pairs. – JimB Apr 11 '19 at 22:49
  • The variance has to be positive! Your definition of the kernel which is to be differentiated is incorrect. You must differentiate the kernel as function of two variables, not centred versions. My answer to the same question should clarify your error.https://stats.stackexchange.com/a/188027/8298 – g g Apr 11 '19 at 22:50
  • @JimB Thank you, I should've included the link. [These](https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties) are the necessary properties. – Mathews24 Apr 11 '19 at 23:59
  • I agree it has to be positive. The terms $x_a$ and $x_b$ are simply constant, and this would be an example of a non-stationary kernel (e.g. akin to the local Gaussian kernel [here](https://george.readthedocs.io/en/latest/user/kernels/), no? Are non-stationary kernels not permitted/valid? @gg – Mathews24 Apr 12 '19 at 00:00
  • How did you check that $k$ is positive definite? Your $k$ is not a kernel. If $x_a\neq x_b$ it is not symmetric. And if $x_a=x_b$ it is a rank one matrix. I am not sure if this makes sense for regression/interpolation. You want positive definite! – g g Apr 12 '19 at 09:03

0 Answers0