0

I have a dataset with few (20,000) data points and many (100) features ranging from 0 to 1. The dataset is divided into two classes with even distribution. I'm doing a classification task on this and want to compare that to a reasonable baseline.

One option would be to take a random guessing baseline where we consider the same classes but without features. That would give an accuracy of 0.5, because the classes are evenly distributed.

As the feature-data point ratio increases, spurious correlations make it easier to classify the data points. I therefore take another baseline, where I use the same data but randomly distribute the classes over the data points. Then I build the same classifier on that data (B), and compare it to the classifier built on the actual data (A). The difference between (A) and (B) tells me how I improve compared to 'seeing patterns in randomness'; the difference between (B) and the random guessing baseline tells me how easy it is to find spurious correlations in the dataset.

Is there a name for a baseline like (B) described here?

  • You are asking about using *permutation test*, is this the term you're looking for? – Tim Nov 29 '17 at 15:22
  • @Tim thank you! That is exactly what I'm looking for. I needed to know the name to discuss and reference it in my report. –  Nov 29 '17 at 15:26

1 Answers1

0

What you are describing is a permutation test. Permutation tests are one of the resampling methods used in statistics. If you want to read more on such methods, you can check the introductory book Introduction to Statistics Through Resampling Methods and R by Phillip I. Good. You can also find an example in this thread.

Tim
  • 108,699
  • 20
  • 212
  • 390