Questions tagged [kernel-trick]

Kernel methods are used in machine learning to generalize linear techniques to nonlinear situations, especially SVMs, PCA, and GPs. Not to be confused with [kernel-smoothing], for kernel density estimation (KDE) and kernel regression.

In machine learning, the kernel trick is a widely applied method to generalize linear techniques to non-linear cases. The most widely used applications include support vector machines (for classification, regression, and anomaly detection), Gaussian processes (for classification and regression), and principal components analysis (for dimensionality reduction). Such uses are also known as kernel methods.

A kernel is a function $k : \mathcal X \times \mathcal X \to \mathbb R$ that can be thought of roughly as a similarity function on the domain $\mathcal X$. Kernel functions exist for many domains, including $\mathbb R^n$ (in which case they can allow more complicated nonlinear relationships) as well as sets, graphs, strings, probability distributions, and other complicated objects. César R. Souza has cataloged many common kernel functions in Kernel Functions for Machine Learning Applications.

The kernel trick works because if $k$ is a positive semidefinite function, then there is a corresponding Hilbert space $\mathcal H$, known as the reproducing kernel Hilbert space (RKHS) of $k$, and a "feature map" $\varphi : \mathcal X \to \mathcal H$ such that $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$. Thus, if an algorithm accesses the data only in the form of inner products $x^T y$, it can be "kernelized" by simply replacing those inner products with $k(x, y)$, in which case it corresponds to performing the algorithm in the Hilbert space $\mathcal H$. For many common kernels, $\mathcal H$ is very high- or even infinite-dimensional, so that actually representing the data in that space would be impossible, but by using pairwise kernel evaluations the algorithm can still be run.

For large datasets, pairwise evaluations can be too computationally expensive to be practical. In these cases approximations such as the Nyström method (which approximates the kernel function based on kernel evaluations to landmark points) or approximate embeddings (which give a function $z : \mathcal X \to \mathbb R^D$ such that $z(x)^T z(y) \approx k(x, y)$) can be used.

Note that the word "kernel" is also used to refer to the local similarity functions of kernel smoothing techniques like kernel density estimation and Nadaraya-Watson kernel regression. See [kernel-smoothing] for this usage.

690 questions

166

votes

5 answers

How to intuitively explain what a kernel is?

Many machine learning classifiers (e.g. support vector machines) allow one to specify a kernel. What would be an intuitive way of explaining what a kernel is? One aspect I have been thinking of is the distinction between linear and non-linear…

asked May 18 '15 at 19:43

hashkey

1,661
3
9
3

107

votes

4 answers

How to select kernel for SVM?

When using SVM, we need to select a kernel. I wonder how to select a kernel. Any criteria on kernel selection?

machine-learning svm kernel-trick

asked Nov 07 '11 at 11:12

xiaohan2012

6,819
5
18
18

votes

2 answers

What is a "kernel" in plain English?

There are several distinct usages: kernel density estimation kernel trick kernel smoothing Please explain what the "kernel" in them means, in plain English, in your own words.

kernel-trick kernel-smoothing

asked Sep 09 '10 at 00:15

Neil McGuigan

9,292
13
54
62

votes

4 answers

What makes the Gaussian kernel so magical for PCA, and also in general?

I was reading about kernel PCA (1, 2, 3) with Gaussian and polynomial kernels. How does the Gaussian kernel separate seemingly any sort of nonlinear data exceptionally well? Please give an intuitive analysis, as well as a mathematically involved…

machine-learning pca svm kernel-trick

asked Jan 02 '15 at 08:18

Simon Kuang

2,051
3
17
18

votes

2 answers

Linear kernel and non-linear kernel for support vector machine?

When using support vector machine, are there any guidelines on choosing linear kernel vs. nonlinear kernel, like RBF? I once heard that non-linear kernel tends not to perform well once the number of features is large. Are there any references on…

machine-learning classification svm references kernel-trick

asked Oct 17 '13 at 02:21

user3269

4,622
8
43
53

votes

4 answers

How can SVM 'find' an infinite feature space where linear separation is always possible?

What is the intuition behind the fact that an SVM with a Gaussian Kernel has inﬁnite dimensional feature space?

svm feature-selection kernel-trick

asked Dec 23 '13 at 11:51

user36162

votes

3 answers

What is the rationale of the Matérn covariance function?

The Matérn covariance function is commonly used as kernel function in Gaussian Process. It is defined like this $$ {\displaystyle C_{\nu }(d)=\sigma ^{2}{\frac {2^{1-\nu }}{\Gamma (\nu )}}{\Bigg (}{\sqrt {2\nu }}{\frac {d}{\rho }}{\Bigg )}^{\nu…

spatial gaussian-process kernel-trick

asked Jan 11 '18 at 01:28

Recuerdos de la Alhambra

votes

2 answers

Which search range for determining SVM optimal C and gamma parameters?

I am using SVM for classification and I am trying to determine the optimal parameters for linear and RBF kernels. For the linear kernel I use cross-validated parameter selection to determine C and for the RBF kernel I use grid search to determine C…

classification svm kernel-trick

asked Nov 19 '12 at 16:33

Kywia

votes

3 answers

How to prove that the radial basis function is a kernel?

How to prove that the radial basis function $k(x, y) = \exp(-\frac{||x-y||^2)}{2\sigma^2})$ is a kernel? As far as I understand, in order to prove this we have to prove either of the following: For any set of vectors $x_1, x_2, ..., x_n$ matrix…

svm kernel-trick

asked Sep 03 '12 at 21:19

Leo

2,484
3
22
29

votes

4 answers

Is there any supervised-learning problem that (deep) neural networks obviously couldn't outperform any other methods?

I have seen people have put a lot of efforts on SVM and Kernels, and they look pretty interesting as a starter in Machine Learning. But if we expect that almost-always we could find outperforming solution in terms of (deep) Neural Network, what is…

machine-learning svm kernel-trick supervised-learning

asked Feb 20 '17 at 00:46

Robin

votes

3 answers

Difference between a SVM and a perceptron

I am a bit confused with the difference between an SVM and a perceptron. Let me try to summarize my understanding here, and please feel free to correct where I am wrong and fill in what I have missed. The Perceptron does not try to optimize the…

machine-learning svm kernel-trick

asked Jun 07 '13 at 19:15

CuriousMind

2,133
5
24
32

votes

1 answer

What are the advantages of kernel PCA over standard PCA?

I want to implement an algorithm in a paper which uses kernel SVD to decompose a data matrix. So I have been reading materials about kernel methods and kernel PCA etc. But it still is very obscure to me especially when it comes to mathematical…

pca svd kernel-trick

asked Apr 20 '14 at 17:37

CyberPlayerOne

2,009
3
22
30

votes

4 answers

Feature map for the Gaussian kernel

In SVM, the Gaussian kernel is defined as: $$K(x,y)=\exp\left({-\frac{\|x-y\|_2^2}{2\sigma^2}}\right)=\phi(x)^T\phi(y)$$ where $x, y\in \mathbb{R^n}$. I do not know the explicit equation of $\phi$. I want to know it. I also want to know…

machine-learning svm kernel-trick

asked Sep 11 '13 at 14:40

Vivian

votes

4 answers

The difference of kernels in SVM?

Can someone please tell me the difference between the kernels in SVM: Linear Polynomial Gaussian (RBF) Sigmoid Because as we know that kernel is used to mapped our input space into high dimensionality feature space. And in that feature…

machine-learning svm pattern-recognition kernel-trick

asked Mar 20 '14 at 16:47

user3378327

votes

3 answers

Is Kernel PCA with linear kernel equivalent to standard PCA?

If in kernel PCA I choose a linear kernel $K(\mathbf{x},\mathbf{y}) = \mathbf x^\top \mathbf y$, is the result going to be different from the ordinary linear PCA? Are the solutions fundamentally different or does some well defined relation exist?

pca kernel-trick

asked Jun 05 '14 at 20:36

tgoossens

2 3

…

45 46 Next