1

I am studying about kernel ridge regression and please bear with me, I have a couple points that left me very startled.

Here is the equation for Kernel ridge regression: $f^{ridge}(x) = \phi(x)^T\beta^{ridge} = \phi(x)^TX^T(XX^T+\lambda I)^{-1}y $

The explanation below clarifies the kernel matrix definition:

$K_{ij} = k(x_i, x_j) := \phi(x_i)^T\phi(x_j)$

And a definition for a $\kappa$ vector:

$\kappa(x)^T = \phi(x)^TX^T = k(x,x_{i:n})$

The regression equation is then: $f^{ridge}(x) = \kappa(x)^T(K + \lambda I)^{-1}y$

It seems to be useful because:

→ at no place we actually need to compute the parameters β

→ at no place we actually need to compute the features φ($x_i$)

→ we only need to be able to compute k(x, x') for any x, x'

I don't really understand why we don't have to compute the features for φ($x_i$) - actually as far as I see we have to compute the features for all elements as they are incorporated in the kernel matrix K as well as in vector $\kappa$. Is there a way to compute the dot product without computing the features? Or what is this referring to?

Danica
  • 21,852
  • 1
  • 59
  • 115
user412953
  • 95
  • 5

1 Answers1

1

Is there a way to compute the dot product without computing the features?

Yes, exactly; this is the point of the "kernel trick." For example, the very popular exponentiated quadratic (also called Gaussian) kernel is $$ k(x, y) = \exp\left( - \frac12 \lVert x - y \rVert^2 \right) ,$$ which you can see can be computed pretty straightforwardly, but the corresponding features $\varphi$ are rather complicated and even infinite-dimensional, so we don't use them in practice.

Danica
  • 21,852
  • 1
  • 59
  • 115