2

I'm working on a binary classification task, and I've been running into problems generalizing from my cross-validation to my test set (model does great on cross validation, very poorly on test set).

I decided to try using some synthetic data to see if my tuning and validation procedure was reasonable. I thought that I should generate some fake, completely random data and if my methodology is sound, the cross-validation should NOT find any good models because, well, the data is random. So, I:

  1. Created a set of randomly generated data, 1 binary target variable, 10 features (from a normal distribution)
  2. Ran cross validation using k-NN (K-nearest neighbors) algorithm. I did grid-search hyperparameter tuning and feature selection simultaneously in cross validation (again, all columns of data are randomly generated from normal distribution)
  3. Compared best model found in CV to the baseline model (which predicts the same class every time).

What I found is that even though the data is totally randomly generated, I find that the "best" model that gives an accuracy of 60% on cross validation, when the baseline accuracy (on CV) is 55%.

How can I avoid this? I can't possibly rely on a cross validation if the model finds "patterns" in completely random data.

I would greatly appreciate any ideas or advice on this issue!

EDIT: My real dataset is about 700 rows, so I created my synthetic data to also have 700 rows. I decided to increase this dramatically to see if the problem persists, so I created a synthetic dataset of 10,000 rows and tried the above procedure again. The cross validation error was much closer to the baseline (52% vs. 51%). So it looks like initially, I really did try too many things on too little data and got lucky.

My question then becomes this: on my smaller dataset, what could I do to keep cross-validation useful? I mean, if it's going to overfit every time, how can I combat that to give a more reasonable model or error estimation?

Is the answer nested cross validation?

EDIT: I tried nested cross validation, and that's exactly what I needed to do. I forgot about the separation of model selection and model evaluation. K-fold cross validation provides a best model (model selection), but can't give an accurate estimation of that model's out-of-sample performance. So, with nested cross validation, I saw that no models selected through cross validation on this dataset was able to generalize to the test-sets, which is exactly what I expected.

Thank you all for your advice, discussions and responses.

  • How *exactly* do you create "randomly generated" data? – whuber Feb 01 '21 at 21:00
  • @whuber I used the R function rnorm() to draw 700 samples from a normal dist. with mean=0, sd=0.01. I then took shorter, slightly offset sets from those 700 points to create features. So, to be more clear: target_variable=setof_random_numbers[10:700] (converted to a binary 0/1 depending on if it's positive or negative), feature1=setof_random_numbers[9:699], feature2=setof_random_numbers[8:698] etc.. I don't see this being problematic, since a single sample/draw never appears twice in a row (since the sets are "offset" by 1). – Vladimir Belik Feb 01 '21 at 21:04
  • 1
    The features are definitely not independent! Why didn't you just draw independent random values for every entry in your data table? – whuber Feb 01 '21 at 21:47
  • @whuber I just tried what you suggested (drawing 700 random samples for each column/feature) and the result is the exact same. They are independent because every sample is independent, so one sample can't predict the next draw/sample, so it doesn't matter that I'm using a lagged/shifted feature set. There is also no leakage because one sample doesn't ever appear twice in a single row. Maybe I should have initially done it this way (using rnorm() 11 different times) so that it's easier to explain. – Vladimir Belik Feb 01 '21 at 21:55
  • @whuber I added an edit into my question where I drastically increased the amount of data used, and the issue almost went away (again, I really don't think a lack of independence between features and target variable is the issue here). – Vladimir Belik Feb 01 '21 at 22:00
  • I agree, but it was worth examining the issue. – whuber Feb 01 '21 at 22:00
  • 1
    +1 for the fact that you sanity-checked your approach to completely random data. That is a great idea! Also, part of the problem stems from [well-known weaknesses of accuracy as an evaluation criterion](https://stats.stackexchange.com/q/312780/1352). – Stephan Kolassa Feb 02 '21 at 07:13
  • @StephanKolassa Thank you! And thank you for the link. I figured that accuracy is fine for a 50/50 balanced binary classification problem, and I still think it's reasonable, but I completely forgot about the idea of using probabilities instead (that's called "entropy" or "log-loss", right?) – Vladimir Belik Feb 02 '21 at 14:30
  • 1
    People speak of "probabilistic predictions", which they assess using *proper scoring rules*, one of which the log-loss is. – Stephan Kolassa Feb 02 '21 at 14:31

1 Answers1

1

From your description, everything seems to work "as designed".

  • Machine learning algorithms work by finding patterns in the data. The patterns are used to make such classifications that achieve the best value of cost function on the training data. If your data is random*, the better the performance on training set, the more your algorithm is finding spurious patterns and overfits it.
  • When you are using validation set for finding hyperparameters, this means that you are comparing performance of different models on validation set, and picking the one with best value of the validation set cost function. If in previous step you were choosing the model that potentially overfits the training data, here you are prone to choosing the model that overfits the validation set.
  • In the end, you are testing your model on the hold-out test set. This measures the overfitting, since the data was "not seen" by the model on either of the steps. If your model overfits, and on random data it can only perform poorly on training set, or overfit to it, test set metrics would help you with identifying it. If after looking at the test set metrics you decide to make improvements in your model, this can easily lead to cherry-picking and overfiting to test set.

There is no way of "fixing" this. The solution are standard machine learning procedures and tricks, like using cross validation, regularization, picking reasonable models for the task (e.g. not using neural networks when your data is small etc.). Other than that, you should always worry about overfitting and make sure it is not the case.

* - I assume in here, that by "random data" you mean that the values of features are random and unrelated to the classes, otherwise the answer it be more complicated.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • So, are you saying that what is happening is simply that I am overfitting the validation sets? As in, I tried enough different things enough times that some of it happened to work considerably better than chance throughout my validation folds? (That's correct, my features are randomly generated numbers that are not in any way connected to my target variable). – Vladimir Belik Feb 01 '21 at 20:45
  • 1
    @VladimirBelik what else would you expect? If the data is "random" there's nothing to learn, you can only overfit. So the more models & hyperparametrs you try, the greater chance of finding some combination that overfits the validation set. – Tim Feb 01 '21 at 20:49
  • As mentioned in my question, my expectation was that cross validation would not find a model that performs considerably better than chance. I guess I'm just confused because this means that in cross validation, I literally can't distinguish a useful feature/feature-set from a completely irrelevant one. – Vladimir Belik Feb 01 '21 at 20:54
  • Maybe I don't understand the purpose of cross-validation - I thought the point was to provide a method to test different features/hyperparameters that was relatively robust to overfitting (since I'm not just using a single validation set). – Vladimir Belik Feb 01 '21 at 20:56
  • @VladimirBelik cross validation here means picking the model that performs best on the validation set, nothing more than this. Other than not training the model on validation set, there’s nothing that prevents from overfitting. You could try k-fold cross validation instead. – Tim Feb 02 '21 at 06:12
  • 1
    I understand, thank you. I forgot about the concept of separating model selection and evaluation - cross validation selects the best model, but doesn't provide an accurate estimate of its out-of-sample performance. I'm sorry I didn't make it clear in my question, but k-fold cross validation is exactly what I tried. – Vladimir Belik Feb 02 '21 at 14:20
  • I tried nested cross validation instead of k-fold cross validation and it gave me precisely the response I was expecting - even the best k-fold cross validation models perform no better than chance on the out-of-sample sets. – Vladimir Belik Feb 02 '21 at 14:21