Classification: confidence regions vs. classification metric values

Question

Consider a binary classification problem with a small dataset: 15 instances in class 0 and 15 instances in class 1 and with four features. So, the data matrix is of the size: 30 X 4.

I used a simple logistic regression with 10-fold stratified cross validation to learn a classifier and the resulting accuracy score is about 70% with (f1 ~ 0.72).

I was told, that my classification results do not make any sense, as the sample size (N=30) is too small to find any statistical significant difference between two groups. The explanation was based on a simple computation of the standard error, which in the Binomial approximation ( sqrt(p(1-p)/n) ) 1/(2 sqrt(30)) = ca 10%, which at the 5% significance level gives confidence regions of 40% width.

QUESTION

I am quite confused, as I do not see how to put together the classifier which is trained on the feature set and the estimation of the statistical significance with confidence bounds based on the Binomial distribution.

UPDATE

I understand, that a small sample size may affect the generalisation error, but I can easily assess the error bound of the classification results by performing a nested cross validation and compute the mean error and the standard error, which originates from difference CV splittings.

UPDATE 2

I found this post, which is closely related and there are quite interesting discusions and answers. It might of interest to readers.

score 2 · Accepted Answer · answered Nov 16 '16 at 20:12

2

It is debatable what "sample size" means in that context. Your data-set has 30 data points, but a 10-fold CV creates 10 data-points of classification performance.

I take it you assembled all the CV test sets before computing 1 performance measure for the whole data-set. That is one way to go. Depending on which performance metric you use, you can estimate confidence intervals for this metric when you have just the one estimate.

Another way to go would be statistical testing to establish the confidence intervals of performance metrics and compare them among different algorithms used for the same data-set.

I don't think you went that way. If you did, you would have only 10 sample points for a t-test (irrespective of whether you have 30 or 30000 records in your underlying data-set). That would not be enough since you can't assume a normal distribution and you would need 30 sample points for the central limit theorem.

answered Nov 16 '16 at 20:12

David Ernst

2,799
8
14

exactly, this what I also do not understand. I can perform any statistical tests on the **performance metrics** and compute their confidence regions, but I do see how the sample size plays any role here. I can (somehow) understand what probably meant my colleague who said that: it is impossible to distinguish two very small samples based on a statistical (for example t-test) test. It is related to statistical power of a test, which imply particular constrains on the sample size and the power of a test. I do not understand, how these two concepts might be put together. – Arnold Klein Nov 16 '16 at 20:50
To find a statistically significant discriminative power, one should compute some summary statistics of each sample (for example a sample mean) and them using this a single "feature" to use (for example) t-test or other stats. But how is it related to the fact, that I use a logistic regression with a set of four features? I do not understand conceptually a possible relation between two things. – Arnold Klein Nov 16 '16 at 20:55
1

Accuracy is one of those performance metrics that relies on a confusion matrix. Let's look at an accuracy of 1. This accuracy can be achieved by having all 10 records on the diagonal and non outside that diagonal, or, in a big data-set all 10000 records on the diagonal. The former can more likely happen by accident than the latter, even though they yield the same accuracy. So there are methods for some performance metrics to build a CI based on the underlying confusion matrix of the metric. This CI is based on the number of records in the data, not on a number of samples of the metric. – David Ernst Nov 16 '16 at 21:37
1

You should really not use the word sample size int his context unless you specify whether you mean the number of records in the data-set or the number of performance metrics you have based on your experimental setup. – David Ernst Nov 16 '16 at 21:41
Thanks! 1. By sample size, I mean the number of **records**, and not the number of performance metrics. For example in my case, I have a sample size of 30 records and 10 performance metrics due to 10-fold CV. Can please you give a link to these methods to build CI for confusion matrix? – Arnold Klein Nov 17 '16 at 14:33
It depends on which performance metric you use (it is not possible for all of them). They also work better the more records you have classified in you confusion matrix. – David Ernst Nov 17 '16 at 14:48
For example, for the accuracy and f1-score metrics. I'm googling it now, but if you can briefly explain or give a direct link to how to assess the classif. metric that would be wonderful! – Arnold Klein Nov 17 '16 at 15:23
1

There are some references here that might help you http://stats.stackexchange.com/questions/133722/how-can-i-derive-confidence-intervals-from-the-confusion-matrix-for-a-classifier – David Ernst Nov 17 '16 at 15:39
please see the updated topic (**update 2**) with a link to the similar question if it is of interest to you. – Arnold Klein Nov 17 '16 at 18:52

Classification: confidence regions vs. classification metric values

1 Answers1