How to measure variance in a classification dataset?

Question

I have a dataset that contains 20 predictor variables (both categorical and numeric) and one target variable (2 classes - Good and Bad). But, there are only 23 observations in the dataset. While I wait to receive significantly more observations, what tests / models can I perform on the available dataset to understand the variance between the good and bad cases, and to understand the variance within the cases classified as 'good'?

Ideally, for the data to make sense, I would want the variance within the good cases to be low, and the variance between the good and bad cases to be high.

Would multivariate analysis of variance (MANOVA) work in this case?

score 1 · Answer 1 · 2016-12-13T00:12:19.273

Probably be careful the way you state differences in variance within and between, as these are ANOVA terms. And variance explanation by predictor variables is a regression term.

What if your classes were not linearly separable, and the $X,Y$ scatter plot looked like the image below, where yellow objects were bad and blue were good, and the red and white regions were the ground truth for bad and good? How would variance be approached in this setting. A key issue is that your question is about a 2-class classification problem, and as a classification problem, the solution space can take on a chameleon-like behavior.

For your continuous predictors, start performing univariate regression with a new artificial $y$-variable which is set to $y_i=-1$ for bad and $y_i=+1$ for good $y$-values. Regress $y$ on each univariate $x$. For categorical variables with $k$ levels, recode into $k-1$ dummy indicator variables $(x_i=0,1)$ and then regress the same $y$ on them univariately.

During each univariate run, predict the $\hat{y}_i$ values and assign $\hat{y}_i>0$ to good and $\hat{y}_i \leq 0$ to bad. Then, just determine the correct number of classified objects out of 23, and you have the classification accuracy. (Actually, sens/spec is what you want, because for $n$=100 objects if 95 are normal and 5 are tumor, and the classifier does nothing and assigns normal to everything, then the classification accuracy will still be 95% -- not true for sens/spec).

What is described above will work if the data are linearly separable. There are numerous other options, but you first need to find out if the 2 classes are linearly separable based on single predictors.

Love the diagram, and you raise some excellent points. Note that perfect separation on a particular data set is not always desirable in terms of application to new data or coefficient estimation; see [this thread](http://stats.stackexchange.com/q/11109/28500) and its links. Also, in the case presented by the OP, with only 23 cases to fit a binary classification, evaluating any more than 1 unpenalized predictor will be prone to substantial over-fitting. — EdM, Dec 13 '16 at 01:43
Correct. (Never said anything about more than one predictor). — , Dec 13 '16 at 02:07

How to measure variance in a classification dataset?

1 Answers1