Importance of a Data Point in Regression

Question

I am doing a gaussian process regression. This regression doesnt scale well as it grows $\mathcal{O}^3$. I would like to know if there are any methods that can be used to determine the importance of the data point and its contribution to the regression?

I have a heuristic that I can delete points at random (one-by-one) and the see if RMSE changes significantly or not if it changes significantly then I keep the point otherwise I delete! But as you can see its not fancy way expecially If I have significantly large data points.

ArnoV · Accepted Answer · 2021-02-22T19:57:48.987

Your question is similar to inducing point approximations.

Famous/popular techniques are:

Subset of Regressors (SoR) approximation
Fully Independent Training Conditional (FITC) approximation
Structured Kernel Interpolation or SKI approximation, which is also called KISS-GP

Basic GP

You are given a training set $\mathcal{D} = \{(\mathbf{x}_i,y_i)\}_{1\leq i\leq n}$ with inputs $\mathbf{x}_i$ being (possibly high-dimensional) vectors and (noisy) scalar outputs $y_i$. You are given the task of predicting new outputs $\mathbf{f}_\star$ (or a noisy output $\mathbf{y}_\star$) given new inputs $\mathbf{X}_\star = [\mathbf{x}_{1\star}, \dots, \mathbf{x}_{n\star}]$. The model is $$ \forall i\,,\quad y_i = f(\mathbf{x}_i) + \varepsilon_i\;,\quad \varepsilon_i\sim\mathcal{N}(0,\sigma_\mathrm{n}^2)$$ A GP prior over a function $f$ implies that given a sequence of inputs $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_n]$ and the corresponding function values $\mathbf{f} = [\mathrm{f}_1, \dots, \mathrm{f}_n] = [f(\mathbf{x}_1),\dots,\, f(\mathbf{x}_n)]$ we have $$\mathbb{P}(\mathbf{f}\mid \mathbf{X}) = \mathcal{N}(\mathbf{0},\mathbf{K})$$ with a kernel matrix $\mathbf{K}$ which characterizes the covariance as $\mathbf{K}_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)$. Skipping the usual derivations in the GP inference process, we usually obtain that the predictive distribution is $$\mathbb{P}(\mathbf{f}_\star\mid \mathbf{X}_\star, \mathbf{X},\mathbf{y}) = \mathcal{N}(\pmb{\mu}_\star, \pmb{\Sigma}_\star)$$ with explicit form $$\begin{cases} \pmb{\mu}_\star =& \mathbf{K}_{\star,\mathbf{f}}\left(\mathbf{K}_{\mathbf{f},\mathbf{f}}+\sigma_\mathrm{n}^2\mathbf{I}\right)^{-1}\mathbf{y}\\ \pmb{\Sigma}_\star =&\mathbf{K}_{\star,\star} - \mathbf{K}_{\star,\mathbf{f}}\left(\mathbf{K}_{\mathbf{f},\mathbf{f}}+\sigma_\mathrm{n}^2\mathbf{I}\right)^{-1}\mathbf{K}_{\mathbf{f},\star} \end{cases}$$ But as you know inverting the $n\times n$ matrix $\mathbf{K}_{\mathbf{f},\mathbf{f}}+\sigma_\mathrm{n}^2\mathbf{I}$ takes complexity of $O(n^3)$ which becomes prohibitive for $n \geq 10^3$

Inducing points

Denote latent variables $\mathbf{u} = [\mathrm{u}_1,\dots,\mathrm{u}_m]$ which are called inducing variables. They are output values corresponding to a set of input locations $\mathbf{X}_\mathbf{u}$ which are often called "pseudo-inputs" or "active set". The idea is to decompose the joint GP prior as $$\mathbb{P}(\mathbf{f},\mathbf{f}_\star) = \int\mathbb{P}(\mathbf{f}_\star, \mathbf{f}, \mathbf{u})\,\mathrm{d}\mathbf{u} = \int\mathbb{P}(\mathbf{f}_\star, \mathbf{f}\mid \mathbf{u})\,\mathbb{P}(\mathbf{u})\mathrm{d}\mathbf{u}$$ with $\mathbb{P}(\mathbf{u}) = \mathcal{N}(\mathbf{0},\mathbf{K}_{\mathbf{u},\mathbf{u}})$ and impose the simplification of conditional independence $\mathbf{f}_\star\perp\mathbf{f}\mid \mathbf{u}$. This simplification means that $$ \mathbb{P}(\mathrm{f}_\star, \mathrm{f}, \mathbf{u}) \approx \mathbb{Q}(\mathrm{f}_\star \!\mid\mathbf{u})\mathbb{Q}(\mathrm{f} \!\mid\mathbf{u})\mathbb{P}(\mathbf{u})$$ The essence of this trick is that instead of inverting a $n\times n$ matrix you only invert the matrix on the inducing points $\mathbf{K}_{\mathbf{u},\mathbf{u}}$ which is much smaller ($m\times m$ with $m \ll n$).

There are two important questions to construct a good approximation by only using $m$ points:

How do we pick the inducing points $\mathbf{u}$ ?
How do we pick the distribution approximation $\mathbb{Q}$ ?

Another question, perhaps equivalent to the second one is:

How do we approximate the covariance function $k$ ?

In the Candela & Rasmussen paper there's an explanation for SoR and FITC.

Subset of Regressors

The SoR method can be viewed as the following kernel approximation: $$ {k}(\mathbf{x}, \mathbf{z}) \approx \tilde{k}(\mathbf{x}, \mathbf{z}) = k(\mathbf{x},\mathbf{u})\mathbf{K}_{\mathbf{u},\mathbf{u}}^{-1}k(\mathbf{u},\mathbf{z})$$ You can pick the inducing points on a uniform grid to start. Do spend time to experiment with each method to decide which one is better in your situation.

Implementation

In python you have the possibility of using GPytorch which also supports GP inference on GPU. I am not yet aware of any similar package for the R language.

thanks for elaboration. I was wondering if there is any method out there such that we could anchor kernel length scale $\lambda$ at arbitrary data points and keep only single point in that interval as we know points in that intervel are correlated so perhaps keeping one should suffice? or something to this effect? — user0193, Feb 22 '21 at 21:01

Importance of a Data Point in Regression

1 Answers1

Basic GP

Inducing points

Subset of Regressors

Implementation