3

I posted this originally in Stack Overflow but realize it might be more of a statistics question. I am using SKLearn to run SVC on my data.

from sklearn import svm

svc = svm.SVC(kernel='linear', C=C).fit(X, y)

I want to know how I can get the distance of each data point in X from the decision boundary? Essentially, I want to create a subset of my data which only includes points that are 1 standard deviation or less away from the decision boundary. I'm looking for the most optimal way to do this.

user1566200
  • 837
  • 1
  • 9
  • 18
  • what do you think about this ?http://stats.stackexchange.com/questions/168051/what-is-the-relationship-between-a-kernel-and-projection-to-a-higher-dimensional/168082#168082 –  Aug 20 '15 at 18:07

2 Answers2

4

I refer to my answer to the question: How to calculate decision boundary from support vectors?, where; in the case of a linear kernel, $K(x,y)= x \cdot y$, where '$\cdot$' is the inner product, so if you have n features for each observation then $ x \cdot y=\sum_{i=1}^n x_i y_i$.

In the answer I referred to supra, you can see that equation for the boundary (the separating hyperplane) is $f(x)=\sum_{k \in SV} \alpha_k y_k s_k \cdot x + b$. For computing $b$ you should take one observation for which the Lagrange multiplier is strictly smaller than $C$, and strictly positive. Assume this is object $m$ and use it to compute $b$ as $b =\frac{1}{y_m} - \sum_{k \in SV} \alpha_k y_k x_m \cdot s_k$. (it could be that your software computes $b$)

These are the same equations as in the answer that I referred to, with $K(x,y)=x \cdot y$.

From the equation $f(x)$ of the hyperplane, it follows that the normal vector of the separating hyperplane is $w=\sum_{k \in SV} \alpha_k y_k s_k$ (as $y_k$ and $\alpha_k$ are numbers and $s_k$ are vectors, the result is a vector). (it could be that your software computes $w$).

(note that, for a hyperplane with equation $ n \cdot x + c$ (where $n$ and $x$ are vectors and $c$ is a scalar), $n$ is the normal vector to the plane).

If you want to compute the distance from a point (an observation in your case) $x_0$ (the feature vector of the observation), then this distance $D(x_0)$ is given by $D(x_0)=\frac{|f(x_0)|}{\sqrt{w \cdot w}}$ (the vertical bars mean 'take the absolute value').

You can compute $D(x_0)$ for every observation $x_0$ that you have, just take the absolute value of function value $f(x_0)$, $f$ as supra, and divide it by the norm of the normal vector.

1

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X, y = iris.data, iris.target
ovr = OneVsRestClassifier(LinearSVC(random_state=0)).fit(X, y)

distance_to_decision_boundary = ovr.decision_function(X)

The decision_function of OneVsRestClassifier returns the distance of each sample from the decision boundary for each class. (https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier)