As noted in the question and comments, a dataset of $m$ points $(\boldsymbol{x}_i,y_i)$ with $\boldsymbol{x}_i\in\mathbb{R}^n$ and $y_i\in\{-1,+1\}$ is linearly separable if we can find a normal vector $\boldsymbol{a}$ and scalar bias $b$ such that the linear inequality
$$y_i(\boldsymbol{a}^T\boldsymbol{x}_i+b)\geq 1$$
is satisfied for all $i=1,\ldots,m$. Physically this says that for each point $\boldsymbol{x}_i$, the signed distance $d_i$ to the separating plane has sign $y_i$ and magnitude at least $\|\boldsymbol{a}\|$.
This can be written also as
$$\boldsymbol{w}^T\boldsymbol{z}_i \geq 1$$
where
$$\boldsymbol{z}_i = \begin{bmatrix}\boldsymbol{x}_i \\ 1\end{bmatrix} y_i
\text{ , } \boldsymbol{w}=\begin{bmatrix}\boldsymbol{a} \\ b\end{bmatrix}$$
As noted in the question, this is essentially a linear program with a "0" objective function.
The proposed "degree of separability" measure $S_\min$ can be expressed as
$$ \min_{\boldsymbol{\sigma},\boldsymbol{w}} S[\boldsymbol{\sigma}] $$
where $\boldsymbol{\sigma}\in\{0,1\}^m$, the function $S$ is defined by
$$ S[\boldsymbol{\sigma}] = \sum_i\sigma_i = \boldsymbol{1}^T\boldsymbol{\sigma} $$
and the minimization is subject to the constraint
$$ \sigma_i(\boldsymbol{w}^T\boldsymbol{z}_i - 1) \geq 0 \text{ , } i=1,\ldots,m$$
This is a (slightly nonlinear) "mixed integer programming" problem, a problem class which is NP hard (even in the linear & binary case).
Now a more feasible alternative could be to solve a "soft-margin" classification problem using a standard approach like SVM*. The tolerance and/or lagrange multiplier variables could then be used to quantify "degree of separability".
(*For $\lambda\approx 0$ the SVM will essentially reduce to a "softened" version of the linear program above. However to get meaningful distances, you will have to normalize tolerances by $\|\boldsymbol{a}\|$.)
Update: Using something like soft-margin SVM can give you an idea of which points are "hard" to separate, but it would not necessarily address your question of "how many points must be removed to allow a hard-margin split?"
However, I think that if you pre-process the $\boldsymbol{X}$ matrix by whitening, then it should be possible to get some more consistent results. (In terms of numerical offset-scales and directional clustering of the problem points.)