Nystroem Method for Kernel Approximation

Question

I have been reading about the Nyström method for low-rank kernel aproximation. This method is implemented in scikit-learn [1] as a method to project data samples to a low-rank approximation of the kernel feature mapping.

To the best of my knowledge, given a training set $\{x_i\}_{i=1}^n$ and a kernel function, it generates a low-rank approximation of the $n \times n$ kernel matrix $K$ by applying SVD to $W$ and $C$.

$K = \left [ \begin{array}{cc} W & K_{21}^T \\ K_{21} & K_{22} \end{array} \right ]$ $C = \left [\begin{array}{cc} W \\ K_{21} \end{array}\right ]$, $W \in \mathbb{R}^{l\times l}$

However, I do not understand how the low-rank approximation of the Kernel matrix can be used to project new samples to the approximated kernel feature space. The papers I have found (e.g. [2]) are not of great help, for they are little didactic.

Also, I am curious about the computational complexity of this method, both in training and test phases.

[1] http://scikit-learn.org/stable/modules/kernel_approximation.html#nystroem-kernel-approx

[2] http://www.jmlr.org/papers/volume13/kumar12a/kumar12a.pdf

Danica · Accepted Answer · 2018-03-27T22:54:32.307

Let's derive the Nyström approximation in a way that should make the answers to your questions clearer.

The key assumption in Nyström is that the kernel function is of rank $m$. (Really we assume that it's approximately of rank $m$, but for simplicity let's just pretend it's exactly rank $m$ for now.) That means that any kernel matrix is going to have rank at most $m$, and in particular $$ K = \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_n) \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) & \dots & k(x_n, x_n) \end{bmatrix} ,$$ is rank $m$. Therefore there are $m$ nonzero eigenvalues, and we can write the eigendecomposition of $K$ as $$K = U \Lambda U^T$$ with eigenvectors stored in $U$, of shape $n \times m$, and eigenvalues arranged in $\Lambda$, an $m \times m$ diagonal matrix.

So, let's pick $m$ elements, usually uniformly at random but possibly according to other schemes – all that matters in this simplified version is that $K_{11}$ be of full rank. Once we do, just relabel the points so that we end up with the kernel matrix in blocks: $$ K = \begin{bmatrix} K_{11} & K_{21}^T \\ K_{21} & K_{22} \end{bmatrix} ,$$ where we evaluate each entry in $K_{11}$ (which is $m \times m$) and $K_{21}$ ($(n-m) \times m$), but don't want to evaluate any entries in $K_{22}$.

Now, we can split up the eigendecomposition according to this block structure too: \begin{align} K &= U \Lambda U^T \\&= \begin{bmatrix}U_1 \\ U_2\end{bmatrix} \Lambda \begin{bmatrix}U_1 \\ U_2\end{bmatrix}^T \\&= \begin{bmatrix} U_1 \Lambda U_1^T & U_1 \Lambda U_2^T \\ U_2 \Lambda U_1^T & U_2 \Lambda U_2^T \end{bmatrix} ,\end{align} where $U_1$ is $m \times m$ and $U_2$ is $(n-m) \times m$. But note that now we have $K_{11} = U_1 \Lambda U_1^T$. So we can find $U_1$ and $\Lambda$ by eigendecomposing the known matrix $K_{11}$.

We also know that $K_{21} = U_2 \Lambda U_1^T$. Here, we know everything in this equation except $U_2$, so we can solve for what eigenvalues that implies: right-multiply both sides by $(\Lambda U_1^T)^{-1} = U_1 \Lambda^{-1}$ to get $$ U_2 = K_{21} U_1 \Lambda^{-1} .$$ Now we have everything we need to evaluate $K_{22}$: \begin{align} K_{22} &= U_2 \Lambda U_2^T \\&= \left(K_{21} U_1 \Lambda^{-1}\right) \Lambda \left(K_{21} U_1 \Lambda^{-1}\right)^T \\&= K_{21} U_1 (\Lambda^{-1} \Lambda) \Lambda^{-1} U_1^T K_{21}^T \\&= K_{21} U_1 \Lambda^{-1} U_1^T K_{21}^T \\&= K_{21} K_{11}^{-1} K_{21}^T \tag{*} \\&= \left( K_{21} K_{11}^{-\frac12} \right) \left( K_{21} K_{11}^{-\frac12} \right)^T \tag{**} .\end{align}

In (*), we've found a version of the Nyström embedding you might have seen simply as the definition. This tells us the effective kernel values that we're imputing for the block $K_{22}$.

In (**), we see that the feature matrix $K_{21} K_{11}^{-\frac12}$, which is shape $(n-m) \times m$, corresponds to these imputed kernel values. If we use $K_{11}^{\frac12}$ for the $m$ points, we have a set of $m$-dimensional features $$ \Phi = \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix} .$$ We can just quickly verify that $\Phi$ corresponds to the correct kernel matrix: \begin{align} \Phi \Phi^T &= \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix} \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix}^T \\&=\begin{bmatrix} K_{11}^{\frac12} K_{11}^{\frac12} & K_{11}^{\frac12} K_{11}^{-\frac12} K_{21}^T \\ K_{21} K_{11}^{-\frac12} K_{11}^{\frac12} & K_{21} K_{11}^{-\frac12} K_{11}^{-\frac12} K_{21}^T \end{bmatrix} \\&=\begin{bmatrix} K_{11} & K_{21}^T \\ K_{21} & K_{21} K_{11}^{-1} K_{21}^T \end{bmatrix} \\&= K .\end{align}

So, all we need to do is train our regular learning model with the $m$-dimensional features $\Phi$. This will be exactly the same (under the assumptions we've made) as the kernelized version of the learning problem with $K$.

Now, for an individual data point $x$, the features in $\Phi$ correspond to $$ \phi(x) = \begin{bmatrix} k(x, x_1) & \dots & k(x, x_m) \end{bmatrix} K_{11}^{-\frac12} .$$ For a point $x$ in partition 2, the vector $\begin{bmatrix} k(x, x_1) & \dots & k(x, x_m) \end{bmatrix}$ is just the relevant row of $K_{21}$, so that stacking these up gives us $K_{21} K_{11}^{-\frac12}$ – so $\phi(x)$ agrees for points in partition 2. It also works in partition 1: there, the vector is a row of $K_{11}$, so stacking them up gets $K_{11} K_{11}^{-\frac12} = K_{11}^{\frac12}$, again agreeing with $\Phi$. So...it's still true for an unseen-at-training-time test point $x_\text{new}$. You just do the same thing: $$ \Phi_\text{test} = K_{\text{test},1} K_{11}^{-\frac12} .$$ Because we assumed the kernel is rank $m$, the matrix $\begin{bmatrix}K_{\text{train}} & K_{\text{train,test}} \\ K_{\text{test,train}} & K_{\text{test}} \end{bmatrix}$ is also of rank $m$, and the reconstruction of $K_\text{test}$ is still exact by exactly the same logic as for $K_{22}$.

Above, we assumed that the kernel matrix $K$ was exactly rank $m$. This is not usually going to be the case; for a Gaussian kernel, for example, $K$ is always rank $n$, but the latter eigenvalues typically drop off pretty quickly, so it's going to be close to a matrix of rank $m$, and our reconstructions of $K_{21}$ or $K_{\text{test},1}$ are going to be close to the true values but not exactly the same. They'll be better reconstructions the closer the eigenspace of $K_{11}$ gets to that of $K$ overall, which is why choosing the right $m$ points is important in practice.

Note also that if $K_{11}$ has any zero eigenvalues, you can replace inverses with pseudoinverses and everything still works; you just replace $K_{21}$ in the reconstruction with $K_{21} K_{11}^\dagger K_{11}$.

You can use the SVD instead of the eigendecomposition if you'd like; since $K$ is psd, they're the same thing, but the SVD might be a little more robust to slight numerical error in the kernel matrix and such, so that's what scikit-learn does. scikit-learn's actual implementation does this, though it uses $\max(\lambda_i, 10^{-12})$ in the inverse instead of the pseudoinverse.

Thank you for your great answer. It was of great help. However, I still do not fully understand how the computation of $K_{11}^{-1/2}$ is carried out in scikit-learn. Can you provide some reference of how this is achieved through SVD? — Daniel López, Feb 12 '17 at 11:44
When $A$ is positive semidefinite, the eigendecomposition $U \Lambda U^T$ coincides with the SVD. scikit-learn, because due to numerical error $A$ might be slightly non-psd, instead computes $U \Sigma V^T$, and uses $A^{-\frac12} = V \Sigma^{-\frac12} V^T$, so that $A$'s features become $A V \Sigma^{-\frac12} V^T = U \Sigma V^T V \Sigma^{-\frac12} V^T = U \Sigma^{\frac12} V^T = A^{\frac12}$. It's the same thing, basically. — Danica, Feb 12 '17 at 12:07
Alright, so $K_{11}^{-1/2} = V \Sigma^{-1/2} V^T$, but the computation performed by scikit-learn is not that: self.normalization_ = np.dot(U * 1. / np.sqrt(S), V). Is normalization_ equal to $K_{11}^{1/2}$ or $K_{11}^{-1/2}$? I assume it must be $K_{11}^{-1/2}$, because the normalization_ matrix is later used to project test samples as np.dot(embedded, self.normalization_.T) — Daniel López, Feb 12 '17 at 12:39
Whoops, sorry, yeah they use $U \Sigma^{-\frac12} V^T = K^{-\frac12}$. It all doesn't really matter since $U \approx V$, but since they do the transpose the features for $K_{11}$ end up as $U\Sigma V^T V \Sigma^{-\frac12} U^T = U \Sigma^{\frac12} U^T$. — Danica, Feb 12 '17 at 12:43
Last question, I do not see how S = np.maximum(S, 1e-12), normalization_ = np.dot(U * 1. / np.sqrt(S), V) equals $U\Sigma^{-1/2} V^T$. Specifically, how is $\Sigma^{-1/2}$ computed? I dont see the -1. — Daniel López, Feb 12 '17 at 12:49
Raising a diagonal matrix to a power is the same as raising each element to a power, and $x^{-\frac12} = 1 / \sqrt x$. In numpy broadcasting notation, elementwise multiplication by a vector is the same as right-multiplying by a diagonal matrix. Also, that code uses $V$ to mean what I was calling $V^T$. — Danica, Feb 12 '17 at 12:54
@Dougal thanks much for a great answer. I think I spotted a typo: in the description of the blocks, $K_{21}$ is of dimension $(n-m) \times m$, not $m \times (n-m)$ (to be consistent with the row $\times$ col notation used elsewhere in the post) — Antoine, Oct 09 '17 at 10:45
Thanks for your explanation @Dougal! Something still bothers me though, how did you compute the formula for $\phi(x) = [ k(x, x_1) \dots k(x, x_n)] K_{11}^{-\frac{1}{2}}$ ? — Dragos, Mar 27 '18 at 21:08
So doing that for each point in partition 2 and stacking it up, it's $K_{21} K_{11}^{-\frac12}$, which agrees with the component of $\Phi$ for those points. For points in partition 1, it's $K_{11} K_{11}^{-\frac12} = K_{11}^{\frac12}$ which also agrees with the relevant part of the matrix $\Phi$. — Danica, Mar 27 '18 at 21:31
Probably I didn't explain my problem properly...It seems that you're multiplying a vector of size $n$: $[k(x,x_1) \dots k(x, x_n)]$ with the matrix $K_{11}^{-\frac{1}{2}}$ of size $m \times m$. Am I missing something? — Dragos, Mar 27 '18 at 22:50
Whoops, sorry, that should only be up to $x_m$ (in the re-labeled ordering, so that those are the Nyström basis points). Will fix. — Danica, Mar 27 '18 at 22:51
That makes more sense. Still something that I fail to understand is the fact that your data point x should have $dim(x) = n$. But in your case the mapping $\phi$ works with $m$ inputs. Does this mean that for $k(x, x_i)$ the element $x_i$ has dimension $n$ and each $x_i$ is actually a column from the first $m$ columns of the kernel matrix? — Dragos, Mar 27 '18 at 23:10
$x$ is a data point, its dimension isn't specified here. $x$ might be in $\mathbb R^d$, or it might be a string or something; just say that $x \in \mathcal X$, so that $k : \mathcal X \times \mathcal X \to \mathbb R$. Then $\phi : \mathcal X \to \mathbb R^m$ just stacks up $k(x, x_i)$ for $m$ different inputs. — Danica, Mar 27 '18 at 23:13

Nystroem Method for Kernel Approximation

1 Answers1

Linked