I tried to reproduce the experiments described in this paper and wanted to compare the output of my system with the one described in the article. I am looking for a statistical comparison of the systems (I'm not really confident when it comes to hypothesis testing).
To be more precise the dataset consists of images of certain categories of rooms (bathroom, living room, etc.) acquired in 6 different houses. In the experiment, we use 5 houses to train a classifiers and the 6th to test it (the author call it leave-one-out but strictly speaking I'm not sure it's exactly that). The results is the correct classification rate for each room in each "fold".
Do you have any idea?