use of t-test to compare performance of algorithms

Question

I need a little bit guidance. I have to compare the classification performance of multiple algorithms using simple or paired t-test.

Let's say I have four datasets (A,B,C) with training and test samples. I am running 3 algorithms (SIFT, SURF, ORB) and compute the classification accuracy such 0.9 means 90% of images correctly match from the test dataset.

Let' say I get the following table:

Dataset (A,B,C,D)

SIFT (0.90, 0.84, 0.90,0.45)
SURF (0.84, 0.67, 0.45,0.34)
ORB (0.34,0.45,0.45,0.23)

Can you please guide me how I can compare the performance of these algorithms using some statistical analysis such as simple t-test?

Any guidance will be really appreciated. Thanks.

Greg Snow · Answer 1 · 2014-01-23T22:57:07.637

2

The t-test is for comparing 2 groups (or one group to a theoretical value). With 3 groups (tests) you would need ANOVA and since there is blocking (the generalization of pairing) due to the different datasets you would be using randomized block ANOVA or a mixed effects model.

However, these methods depend on approximate normality and with the nature of your data, it is not likely to be approximately normal and your sample size is not large enough to invoke the CLT. A permutation test is probably your best option given your data.

Here is R code for one possible way to do a permutation test:

SIFT <- c(0.90, 0.84, 0.90, 0.45)
SURF <- c(0.84, 0.67, 0.45, 0.34)
ORB <- c(0.34, 0.45, 0.45, 0.23)

tmpdat <- rbind( SIFT, SURF, ORB )

tmpfun <- function(m) diff( range( rowMeans(m) ) )

out <- c( tmpfun(tmpdat), 
    replicate( 9999, tmpfun( apply(tmpdat, 2, sample) ) ) )
hist(out)
abline(v=out[1])
mean( out >= out[1] )

edited Jan 23 '14 at 22:57

answered Jan 23 '14 at 22:50

Greg Snow

46,563
2
90
159

do u know any softwares which I can use to do that let's say some snippet in excel or tool ? – user1388142 Jan 23 '14 at 23:05
hi thanks a lot. just one question i am not a statistics guy. I have seen the histogram? How I can infer which one is best from such kind of histogram? plz ur guidance will be really nice – user1388142 Jan 24 '14 at 01:42
I haven't worked it out for this case, but with few observations and the existence of several ties, the achievable significance levels on a permutation test may be somewhat limited. – Glen_b Jan 24 '14 at 05:56
Further, if data are dependent (akin to "paired") the number of available combinations is further reduced. – Glen_b Jan 24 '14 at 06:29

Marc Claesen · Accepted Answer · 2014-01-23T22:55:15.343

1

I suggest using a paired t-test because the accuracies on different data sets should not be compared directly. Each data set you test on should form a pair in your t test.

Based on your example, you would be doing something like this in R to compare SIFT and SURF:

SIFT <- c(0.90, 0.84, 0.90, 0.45)
SURF <- c(0.84, 0.67, 0.45, 0.34)
SIFT_v_SURF <- t.test(SIFT,SURF,paired=TRUE,alternative="greater")

Note: by using a t-test you are assuming normality which may not be the case.

edited Jan 23 '14 at 22:55

answered Jan 23 '14 at 22:49

Marc Claesen

17,399
1
49
70

what is R ? plz – user1388142 Jan 23 '14 at 22:59
R is a (free) statistical programming package. You can find more information here: http://www.r-project.org/. – Marc Claesen Jan 23 '14 at 23:01
cay u plz also tell me how to interprete values i.e. how I can claim SURF is better than SIFT . . The output is ! !! ! data: SIFT and SURF t = -2.9785, df = 3, p-value = 0.9707 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -0.6310159 Inf sample estimates: mean of the differences -0.3525 – user1388142 Jan 24 '14 at 00:33
hi . . . marc plz waiting for ur reply – user1388142 Jan 28 '14 at 04:06
A large $p$-value indicates the null hypothesis is not rejected. Based on your results you tested whether the mean in `SIFT` was not smaller than the mean of `SURF`. You get a large $p$-value indicating that your results are explained well by the null hypothesis (e.g. they are likely to occur when the null hypothesis is true). Testing the other way around would have been a better choice (since the values for `SIFT` appear to be larger than those of `SURF`). Another solution would have been to use a two-tailed t test. – Marc Claesen Jan 28 '14 at 09:12
hi marc some more guidance. now i have 22 observations from SIFT and SURF. According to me they should be equal. So my null Hypothesis is that both are different while otherwise they are similar. I used excel and got my t-value 3.493 beyond the critical value of two tail distribution t (2.345). Is this information sufficient to reject null hypothesis and say both have almost similar means ? – user1388142 Feb 03 '14 at 07:49
The null hypothesis of a two-tailed t test is that the means are the same. If you get a significant result this signifies the null hypothesis is rejected, e.g. the means are not the same. – Marc Claesen Feb 03 '14 at 09:15
I am using two sample paired t-test from MATLAB . . . – user1388142 Feb 03 '14 at 09:43

score 0 · Answer 3 · answered Jan 23 '14 at 22:56

0

Usually you don't summarise since performance of specific algorithm is related to the characteristics of the specific dataset. In literature you see phrases like "Algorithm X won in 5 out of 10 datasets".

However, if numbers are correct, in your case there is a clear winner and that's SIFT: it beats all other algorithms in all datasets.

answered Jan 23 '14 at 22:56

iliasfl

2,514
17
30

u r right but i have been asked to use it anyway some sort of examiner comments on my paper. – user1388142 Jan 23 '14 at 23:09
I raised a question [here](http://stats.stackexchange.com/questions/83186/can-you-do-statistics-with-3-data-points) because I really want to know if this a correct approach from a statistical point of view. – iliasfl Jan 23 '14 at 23:57

use of t-test to compare performance of algorithms

3 Answers3

Linked