1

I have built machine learning models for a classification problem with four classes. They run at around 70% nested cross validation accuracy.

I am looking to do further testing to check of overfitting, I am wondering if I gave my models a dataset of randomly generated numbers with random labels between 1-4, could I use this to check if they are overfitting to the data?

My assumption would be that at random with 4 labels they should have only ~25% accuracy by chance, if I have a model that is having a higher accuracy could that suggest overfitting?

I have been trying to find papers that explore this but I have found none. I am new to machine learning, am I missing something important as to why using randomized data in this way would not work to check overfitting?

DN1
  • 69
  • 9

2 Answers2

1

The usual way is to test your success metric across test and train data. The gap between the two generally signals overfitting. Also, accuracy isn't a good metric; check here for a good explanation. Moreover, having such tests with random data has its own problems. If your model overfits the data, it might still easily perform 25 % accuracy on random data. For example, let's say your classifier does its job very badly and classifies your samples on one class only (due to overfit, or underfit, or it might do this even if it is a very good classifier when the data is generated accordingly). Then, imagine that you just distribute random class numbers to these samples. It'll again result in 25 % accuracy. This won't indicate anything about overfitting.

gunes
  • 49,700
  • 3
  • 39
  • 75
1

Yes you can do this: this is closely related to so-called permutation tests (you'd be doing a permutation test under the null hypothesis indistinguishable classes).

But: there are types of overfitting that you can only detect if the permutation test incorporates the correct assumptions about data structure.

Consider the following situation:

  • Your input data has clusters of similar rows, which truly belong to the same class. But you do not know this, and model each row as independent of all other rows.
  • The model overfits: it learns the clusters "by hard", i.e. while it has good predictive ability for unknown rows of known clusters, it's predictions for rows of unknown clusters is bad.
  • The nested cross validation won't detect this: you'd need to implement splitting-by-cluster in order to detect this overfitting. But you cannot do that as you do not know of the clusters.
  • Neither will the permutation test: that can also detect the overoptimism in the cross validation results only if the random labels are random by cluster (which you still don't know...)

Some more thoughts:

  • there are situations where you should get worse than guessing performance in the permutation test of a cross validation: this happens if the classifier in the absence of useful information guesses more-or-less the majority class, but as the row to be tested was removed, the test row on average belongs to the minority class.

  • Permutation tests are useful to detect data leakage in the cross validaton that happens while the overall data structure (clustering) is modeled correctly. Examples for such leaks would be preprocessing that includes information from the test data.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • Thank you for this, this is now much clearer to me. I have also been looking for papers that talk about this topic to understand it further - if by chance you know any of those so I could learn more that would be really helpful. – DN1 Jul 19 '19 at 09:53