Questions tagged [kernel-trick]

Kernel methods are used in machine learning to generalize linear techniques to nonlinear situations, especially SVMs, PCA, and GPs. Not to be confused with [kernel-smoothing], for kernel density estimation (KDE) and kernel regression.

In machine learning, the kernel trick is a widely applied method to generalize linear techniques to non-linear cases. The most widely used applications include support vector machines (for classification, regression, and anomaly detection), Gaussian processes (for classification and regression), and principal components analysis (for dimensionality reduction). Such uses are also known as kernel methods.

A kernel is a function $k : \mathcal X \times \mathcal X \to \mathbb R$ that can be thought of roughly as a similarity function on the domain $\mathcal X$. Kernel functions exist for many domains, including $\mathbb R^n$ (in which case they can allow more complicated nonlinear relationships) as well as sets, graphs, strings, probability distributions, and other complicated objects. César R. Souza has cataloged many common kernel functions in Kernel Functions for Machine Learning Applications.

The kernel trick works because if $k$ is a positive semidefinite function, then there is a corresponding Hilbert space $\mathcal H$, known as the reproducing kernel Hilbert space (RKHS) of $k$, and a "feature map" $\varphi : \mathcal X \to \mathcal H$ such that $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$. Thus, if an algorithm accesses the data only in the form of inner products $x^T y$, it can be "kernelized" by simply replacing those inner products with $k(x, y)$, in which case it corresponds to performing the algorithm in the Hilbert space $\mathcal H$. For many common kernels, $\mathcal H$ is very high- or even infinite-dimensional, so that actually representing the data in that space would be impossible, but by using pairwise kernel evaluations the algorithm can still be run.

For large datasets, pairwise evaluations can be too computationally expensive to be practical. In these cases approximations such as the Nyström method (which approximates the kernel function based on kernel evaluations to landmark points) or approximate embeddings (which give a function $z : \mathcal X \to \mathbb R^D$ such that $z(x)^T z(y) \approx k(x, y)$) can be used.


Note that the word "kernel" is also used to refer to the local similarity functions of kernel smoothing techniques like kernel density estimation and Nadaraya-Watson kernel regression. See [kernel-smoothing] for this usage.

690 questions
166
votes
5 answers

How to intuitively explain what a kernel is?

Many machine learning classifiers (e.g. support vector machines) allow one to specify a kernel. What would be an intuitive way of explaining what a kernel is? One aspect I have been thinking of is the distinction between linear and non-linear…
hashkey
  • 1,661
  • 3
  • 9
  • 3
107
votes
4 answers

How to select kernel for SVM?

When using SVM, we need to select a kernel. I wonder how to select a kernel. Any criteria on kernel selection?
xiaohan2012
  • 6,819
  • 5
  • 18
  • 18
80
votes
2 answers

What is a "kernel" in plain English?

There are several distinct usages: kernel density estimation kernel trick kernel smoothing Please explain what the "kernel" in them means, in plain English, in your own words.
Neil McGuigan
  • 9,292
  • 13
  • 54
  • 62
74
votes
4 answers

What makes the Gaussian kernel so magical for PCA, and also in general?

I was reading about kernel PCA (1, 2, 3) with Gaussian and polynomial kernels. How does the Gaussian kernel separate seemingly any sort of nonlinear data exceptionally well? Please give an intuitive analysis, as well as a mathematically involved…
Simon Kuang
  • 2,051
  • 3
  • 17
  • 18
53
votes
2 answers

Linear kernel and non-linear kernel for support vector machine?

When using support vector machine, are there any guidelines on choosing linear kernel vs. nonlinear kernel, like RBF? I once heard that non-linear kernel tends not to perform well once the number of features is large. Are there any references on…
user3269
  • 4,622
  • 8
  • 43
  • 53
41
votes
4 answers

How can SVM 'find' an infinite feature space where linear separation is always possible?

What is the intuition behind the fact that an SVM with a Gaussian Kernel has infinite dimensional feature space?
user36162
  • 551
  • 1
  • 5
  • 4
40
votes
3 answers

What is the rationale of the Matérn covariance function?

The Matérn covariance function is commonly used as kernel function in Gaussian Process. It is defined like this $$ {\displaystyle C_{\nu }(d)=\sigma ^{2}{\frac {2^{1-\nu }}{\Gamma (\nu )}}{\Bigg (}{\sqrt {2\nu }}{\frac {d}{\rho }}{\Bigg )}^{\nu…
39
votes
2 answers

Which search range for determining SVM optimal C and gamma parameters?

I am using SVM for classification and I am trying to determine the optimal parameters for linear and RBF kernels. For the linear kernel I use cross-validated parameter selection to determine C and for the RBF kernel I use grid search to determine C…
Kywia
  • 391
  • 1
  • 3
  • 3
38
votes
3 answers

How to prove that the radial basis function is a kernel?

How to prove that the radial basis function $k(x, y) = \exp(-\frac{||x-y||^2)}{2\sigma^2})$ is a kernel? As far as I understand, in order to prove this we have to prove either of the following: For any set of vectors $x_1, x_2, ..., x_n$ matrix…
Leo
  • 2,484
  • 3
  • 22
  • 29
38
votes
4 answers

Is there any supervised-learning problem that (deep) neural networks obviously couldn't outperform any other methods?

I have seen people have put a lot of efforts on SVM and Kernels, and they look pretty interesting as a starter in Machine Learning. But if we expect that almost-always we could find outperforming solution in terms of (deep) Neural Network, what is…
Robin
  • 585
  • 1
  • 6
  • 9
37
votes
3 answers

Difference between a SVM and a perceptron

I am a bit confused with the difference between an SVM and a perceptron. Let me try to summarize my understanding here, and please feel free to correct where I am wrong and fill in what I have missed. The Perceptron does not try to optimize the…
CuriousMind
  • 2,133
  • 5
  • 24
  • 32
35
votes
1 answer

What are the advantages of kernel PCA over standard PCA?

I want to implement an algorithm in a paper which uses kernel SVD to decompose a data matrix. So I have been reading materials about kernel methods and kernel PCA etc. But it still is very obscure to me especially when it comes to mathematical…
CyberPlayerOne
  • 2,009
  • 3
  • 22
  • 30
35
votes
4 answers

Feature map for the Gaussian kernel

In SVM, the Gaussian kernel is defined as: $$K(x,y)=\exp\left({-\frac{\|x-y\|_2^2}{2\sigma^2}}\right)=\phi(x)^T\phi(y)$$ where $x, y\in \mathbb{R^n}$. I do not know the explicit equation of $\phi$. I want to know it. I also want to know…
Vivian
  • 715
  • 2
  • 7
  • 12
32
votes
4 answers

The difference of kernels in SVM?

Can someone please tell me the difference between the kernels in SVM: Linear Polynomial Gaussian (RBF) Sigmoid Because as we know that kernel is used to mapped our input space into high dimensionality feature space. And in that feature…
user3378327
  • 951
  • 2
  • 8
  • 11
29
votes
3 answers

Is Kernel PCA with linear kernel equivalent to standard PCA?

If in kernel PCA I choose a linear kernel $K(\mathbf{x},\mathbf{y}) = \mathbf x^\top \mathbf y$, is the result going to be different from the ordinary linear PCA? Are the solutions fundamentally different or does some well defined relation exist?
tgoossens
  • 549
  • 1
  • 4
  • 8
1
2 3
45 46