I am studying about kernel ridge regression and please bear with me, I have a couple points that left me very startled.
Here is the equation for Kernel ridge regression: $f^{ridge}(x) = \phi(x)^T\beta^{ridge} = \phi(x)^TX^T(XX^T+\lambda I)^{-1}y $
The explanation below clarifies the kernel matrix definition:
$K_{ij} = k(x_i, x_j) := \phi(x_i)^T\phi(x_j)$
And a definition for a $\kappa$ vector:
$\kappa(x)^T = \phi(x)^TX^T = k(x,x_{i:n})$
The regression equation is then: $f^{ridge}(x) = \kappa(x)^T(K + \lambda I)^{-1}y$
It seems to be useful because:
→ at no place we actually need to compute the parameters β
→ at no place we actually need to compute the features φ($x_i$)
→ we only need to be able to compute k(x, x') for any x, x'
I don't really understand why we don't have to compute the features for φ($x_i$) - actually as far as I see we have to compute the features for all elements as they are incorporated in the kernel matrix K as well as in vector $\kappa$. Is there a way to compute the dot product without computing the features? Or what is this referring to?