Statistical analysis to test algorithm robustness

Question

I'm testing the robustness of an image processing algorithm to measure certain feature of images. The output of the algorithm is simple: 1 if the feature is found, 0 if the feature is not found. An image may or may not truly contain a feature, the ground truth of which are recorded.

I will apply the algorithm on 10,000 images with and without added noise. Ideally, the algorithm output for the same image should remain identical if the algorithm is robust against the noise.

The resulting algorithm output for 10,000 images looks like

             alg on rawImage     alg on (rawImage+noise)     ground truth
image#1             1                     1                       1
image#2             0                     1                       1
image#3             0                     0                       0 
image#4             1                     0                       0
...

As seen above, the algorithm extracted the feature in some image, but failed sometimes, also the algorithm could extract the wrong feature, which doesn't match with ground truth.

Could any expert suggest a measure of the algorithm robustness or accuracy in my case?

Meanwhile, I would like to compare algorithm performance on images with and without added noise, i.e., the first and second columns of the results as shown above. Apart from confusion matrix, any other statistical measure I could use?

What do you mean by robustness? The standard usage is not accuracy. — Michael R. Chernick, Apr 26 '17 at 22:38
I thought about confusion matrix to compare the algorithm performance on images with and without added noise. I would also like to apply some statistical measure, could you please suggest? — Kyle, Apr 26 '17 at 22:45
There are many metrics you can compute from a confusion matrix (e.g. see [here](https://stats.stackexchange.com/questions/252968/confusion-matrix-metrics-joint-vs-conditional-probabilities)). Can you describe qualitatively what you would like to measure in more detail? — GeoMatt22, Apr 26 '17 at 22:53
@GeoMatt22, I would like to compare how second. column is different from the first column as shown in the result table, i.e., the difference of algorithm performance on images with and without added noise — Kyle, Apr 26 '17 at 22:59

score 1 · Answer 1 · answered Apr 26 '17 at 23:18

I'm not sure there is a "statistically right" answer here, because this isn't so much about "how can I answer this question?" as "what question should I ask?". My first intuition would be to adapt the LD50 idea from toxicology. I would ask: what level of noise do I have to add to before an individual algorithm is wrong 50% of the time? Algorithms that can handle more noise before reaching this critical level are "more robust" in the way you appear to care about. You may want to use a lower threshold if you deem a lower error rate to be critical. In any case, though, the idea would be to look at each how algorithm responds to varying levels of noise, rather than just one level of noise. Because even if algorithms A and B are both "very bad" at some particular noise level, it's still useful to regard A as "more robust" than B if A was "alright" at a lower noise level but B was already "very bad" at the lower level.

Note that the OP's example data seems to indicate that the algorithm is not always correct at 0 noise, and that noise in some cases appears to "help". — GeoMatt22, Apr 27 '17 at 01:54

Statistical analysis to test algorithm robustness

1 Answers1