2

Problem Description

I want to enhance a photo and I have 3 software to do that, out of which 2 are my own. I show users 2 photos and ask them to choose a better one. Now, I compare software 1 vs. 2, 1 vs. 3. Software 2 and 3 are mine. Now, I can clearly see from the votes, that people choose 2 and 3, over 1, in both sessions.

Task 1: Show that 2 is better than 1, then 3 is better than 1.

A published paper (not on statistics) handled the same task and used a paired t-test. I can do the same, and show that 2 is the better method. I intend to that in MATLAB as follows:

[h,p]=ttest(votes_for_software1,votes_for_software2) %similarly for 1 vs. 3 

I do the above and I get,

h = 1 and p = 7.2372e-04

This confirms (at least I think so) that a mean value of the observed size or greater magnitude will occur with a probability p (interpretation taken from here).

So different people are looking at same set of photos and voting them, am I right in using a paired t-test? Also, am I doing the overall process right? (including the implementation).

Task 2: Is there any way I can compare software 2 and 3 (both of my software)?

The data compared in the comparison of software 1 vs. 2 and 1 vs. 3 remains same. So, I calculate the mean rank obtained for all the three software (lower the better), and I can see that, rank_3 < rank_2 < rank_1 -> Software 2 of mine is the best.

How could show this statistically?

Autonomous
  • 377
  • 1
  • 2
  • 12

1 Answers1

2

In general, you are getting it right. I would suggest two modifications:

  1. If I understand you correctly, the two values the variables can have is 1 (better picture) and 0 (worse picture). This means that the data type is not really interval/ratio/absolute but rather ordinal. In that case I would use a sign test which test whether the subjects choose one over the other more often than not. It has the additional advantage, that it makes less assumptions about the distribution of the data. For the t-test you either need normally distributed data (not true in your case) or a lot of data points (depends on how many subjects you have).

  2. You are performing two test on the same data. For that reason you should account for multiple comparisons. The simplest way (Bonferroni correction) is to multiply each p-value by two.

One additional thought: What do you do if the subjects rank the data inconsistently, i.e. 1 < 2 < 3 < 1?

Concerning the second task: Why don't you show the subjects photos from 2 and 3 and let the rank which one is better? I think this would be the cleanest way.

fabee
  • 2,403
  • 13
  • 18
  • Interesting. Actually I use `t-test` on the number of votes given by `n` people. So the input to `t-test` would be two columns of the voting matrix and hence it would be integer. But if sign test does not make any assumption, I can easily carry out that test with a `0/1` type voting matrix. I have total of 1000 votes (i.e. `1000x2` matrix), if that adds any information. 2. Can you elaborate more on how can I do multiple comparison (regarding implementation)? 3. I cannot repeat the experiments, sorry. 4. The sign test lacks the statistical power when compared to t-test(wiki). What's this? – Autonomous Dec 06 '14 at 21:56
  • I don't understand what do you mean by <>. Users are shown two photos at a time, they have to vote to one of them. Also 1 vs. 2 and 1 vs. 3 are disjoint sessions. – Autonomous Dec 06 '14 at 21:59
  • How about [Wilcoxon signed-rank test](http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) test? – Autonomous Dec 06 '14 at 23:55
  • Inconsitent rating: People are shown two photos at a time. The content is the same, but the processing has been done with different software. Now a single user could rate software 1 better than software 2, software 2 better than software 3 and software 3 better than software 1. This is inconsistent since software 3 should be worse than software 1. There is nothing in the experimental paradigm, that excludes this situation. – fabee Dec 07 '14 at 15:09
  • Wilcoxon signed-rank test would work as well. – fabee Dec 07 '14 at 15:10
  • I don't understand how you feed that data into the test at the moment. I think the 0/1 matrix would be better. You are already doing multiple comparisons, since you do a test for software 1 vs 2 and a test for software 1 vs 3. If I understand correctly, you would like to demonstrate that both are better than 1. You show that with two test, i.e. multiple comparisons. – fabee Dec 07 '14 at 15:12
  • Power: This quantifies how likely it is that a test correctly rejects the Null hypothesis. It basically quantifies how likely your test detects a difference if there is one. – fabee Dec 07 '14 at 15:18
  • The 1000 measurements that you have, are these from 1000 different people? Or do you have e.g. 5 people with 200 pictures each? In that case, you don't have 1000 independent measurements. Here I would recommend: 1) computing the difference between the ratings for each picture: e.g. software 1 - software 2 (the value would then be 1 or -1). 2) compute the mean difference for each person. 3) Compute a one-sample t-test on those means against the value 0. – fabee Dec 07 '14 at 15:22
  • 1. About inconsistency: A single user is asked about 1 vs. 3 once, then 1 vs. 3 once, therefore inconsistency is ruled out. 2. Yes, I want to show primarily, 2 better than 1, 3 better than 1. However, I want to draw some statistical conclusion about 2 vs. 3. I know from experiment statistics that ote-margin of 1 vs. 3 is larger than that of 1 vs. 2, hence 3 is better than 2. But can I show this statistically? 3. I have 13 people with 77 pairs each, hence 1001 measurements, for 1 vs. 2. Another 1001 measurements for 1 vs. 3. – Autonomous Dec 07 '14 at 20:29
  • To draw conclusions about 2 vs. 3 you could form the hypothesis that 3 is favoured more often over 1 than two is favoured over 2. You could do this again by using a paired t-test. I wouldn't draw the conclusion that 3 is favoured over 2. People do not behave metrically and it could happen, that they favour 2 over 3 when you'd directly test it. – fabee Dec 08 '14 at 10:29