101

I have recently been reading a lot on this site (@Aniko, @Dikran Marsupial, @Erik) and elsewhere about the problem of overfitting occuring with cross validation - (Smialowski et al 2010 Bioinformatics, Hastie, Elements of statistical learning). The suggestion is that any supervised feature selection (using correlation with class labels) performed outside of the model performance estimation using cross validation (or other model estimating method such as bootstrapping) may result in overfitting.

This seems unintuitive to me - surely if you select a feature set and then evaluate your model using only the selected features using cross validation, then you are getting an unbiased estimate of generalized model performance on those features (this assumes the sample under study are representive of the populatation)?

With this procedure one cannot of course claim an optimal feature set but can one report the performance of the selected feature set on unseen data as valid?

I accept that selecting features based on the entire data set may resuts in some data leakage between test and train sets. But if the feature set is static after initial selection, and no other tuning is being done, surely it is valid to report the cross-validated performance metrics?

In my case I have 56 features and 259 cases and so #cases > #features. The features are derived from sensor data.

Apologies if my question seems derivative but this seems an important point to clarify.

Edit: On implementing feature selection within cross validation on the data set detailed above (thanks to the answers below), I can confirm that selecting features prior to cross-validation in this data set introduced a significant bias. This bias/overfitting was greatest when doing so for a 3-class formulation, compared to as 2-class formulation. I think the fact that I used stepwise regression for feature selection increased this overfitting; for comparison purposes, on a different but related data set I compared a sequential forward feature selection routine performed prior to cross-validation against results I had previously obtained with feature selection within CV. The results between both methods did not differ dramatically. This may mean that stepwise regression is more prone to overfitting than sequential FS or may be a quirk of this data set.

BGreene
  • 3,045
  • 4
  • 16
  • 33
  • 10
    I don't think that is (quite) what Hastie, et al. are advocating. The general argument is that *if feature selection uses the response* then it better be included as part of your CV procedure. If you do predictor screening, e.g., by looking at their sample variances and excluding the predictors with small variation, that is ok as a one-shot procedure. – cardinal May 04 '12 at 10:19
  • 4
    +1 however even in this case the cross-validation doesn't represent the variance in the feature selection process, which might be an issue if the feature selection is unstable. If you perform the screening first then the variability in the performance in each fold will under-represent the true variability. If you perform the screening in each fold, it will appropriately increase the variability in the performance in each fold. I'd still always perform the screening in each fold if I could afford the computational expense. – Dikran Marsupial May 04 '12 at 11:19
  • 2
    I think the statement "ANY feature selection performed prior to model performance estimation using cross validation may result in overfitting." is a misquote or misrepresentation of what Hastie and others would suggest. If you change the word "prior' to "without" it makes more sense. Also the sentence seems to suggest that cross-validation is the only way to legitimately test the appropriateness of the variables selected. The bootstrap for example might be another legitimate approach. – Michael R. Chernick May 04 '12 at 12:26
  • @MichaelChernick - agreed. I have edited above to better reflect my meaning. – BGreene May 04 '12 at 13:11
  • 1
    @Bgreene: there is a recent discussion on this issue that can be read at http://goo.gl/C8BUa. – Alekk Jul 10 '12 at 15:50
  • @Alekk: The link is dead, unfortunately. – Zero3 Sep 22 '15 at 12:53

3 Answers3

91

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalisation performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this set up, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 5
    Thank you - this is very helpful. If you take the suggested approach how do you then evaluate your final model? As you will have multiple sets of features, how do you choose the final feature set? Historically I have also reported results based on a single cross validation with model parameters and features chosen. – BGreene May 04 '12 at 13:54
  • 24
    It is best to view cross-validation as assessing the performance of a procedure for fitting a model, rather than the model itself. The best thing to do is normally to perform cross-validation as above, and then build your final model using the entire dataset using the same procedure used in each fold of the cross-validation procedure. – Dikran Marsupial May 04 '12 at 13:57
  • 2
    In this case are we then reporting classification results based on cross-validation (potentially many different feature sets) but yet reporting the model to contain only one of those feature sets, i.e. cross-validated classification results do not necessarily match the feature set? – BGreene May 04 '12 at 14:13
  • 14
    Essentially yes, cross-validation only estimates the expected performance of a model building process, not the model itself. If the feature set varies greatly from one fold of the cross-valdidation to another, it is an indication that the feature selection is unstable and probably not very meaningful. It is often best to use regularisation (e.g. ridge regression) rather than feature selection, especially if the latter is unstable. – Dikran Marsupial May 04 '12 at 14:25
  • 1
    Thank you - can this comment be promoted to a topic (I am unsure of procedure)? I think it would be very useful for people in my field (biomedical signal processing) – BGreene May 04 '12 at 15:39
  • 3
    This is such an important post. Amazing how many don't apply this. – Chris A. Sep 21 '12 at 19:31
  • Dikran this is a great post. I thought I would point you to an interesting debate on internal vs external validation in this other thread: http://stats.stackexchange.com/questions/64147/internal-vs-external-cross-validation-and-model-selection It looks like there is some disagreement on the pros and cons of external vs internal validation – Amelio Vazquez-Reina Jul 18 '13 at 00:22
  • @BGreene Is there any good reference you can provide that uses the method of CV for assessing model performance, followed by the development of the final model on the whole dataset using the same procedure? – Sapiens Dec 30 '20 at 06:19
  • 2
    @Sapiens Hastie, Tibshirani and Friedman's The Elements of Statistical learning is the best reference I am aware of. – BGreene Jan 07 '21 at 11:52
  • @DikranMarsupial Thanks for the answers! I have an interesting question. Suppose your boss hands you a dataset of say 30 features and says nothing. How do you know that these 30 features were not selected beforehand based on correlation with the entire dataset labels as described by Hastie? You can imagine that your boss started out with say 500 features initially and reduced it to 30 features based on correlation and hands it to you. You have no idea this was done, so you do cross validation on 30 features and unknowingly report a lower than expected error. What do you do in this situation? – woowz Dec 21 '21 at 09:16
  • @woowz, unfortunately I don't think there is much you can do in those circumstances, unless you can get a second same of data that you can use for testing. – Dikran Marsupial Dec 21 '21 at 09:27
  • @DikranMarsupial Thanks! On another note, in your example it seems like you've found a feature that does well on the training set but not in general. So it seems like you have a lack of data. If you increase NC (The number of samples) and decrease NF (number of features) you'll find that the bias diminishes and cross validation returns the correct result. Hastie's case is even more extreme, with 5000 features and only 50 samples. – woowz Dec 21 '21 at 13:13
  • So in practice, especially in large data sets and if you select your features wisely (e.g. based on domain knowledge), I dont think feature selection before cross validation is a problem. E.g., if we want to predict a persons height, we might select the features gender and age because they tend to correlate with height, and I dont think my cross validation result is biased just because I didnt do the feature selection inside a cv loop. Do you agree? – woowz Dec 21 '21 at 13:13
  • @woowz if you want an unbiased performance estimate, then you have to perform feature selection independently in each trial. Of course adding data and reducing the number of choices to be made will reduce the bias, but how will you know whether the bias in your particular application is negligible without performing the unbiased analysis? – Dikran Marsupial Dec 21 '21 at 13:16
  • "I dont think my cross validation result is biased just because I didnt do the feature selection inside a cv loop. Do you agree?" No, the estimate is biased because it has been directly optimised. Whether that bias is negligible is another matter, but the bias *is* there. – Dikran Marsupial Dec 21 '21 at 13:17
  • Imagine this case: you have 500 features and you do nested cv. In the inner loop, you do feature selection of the top 30 features that correlate with the target and tune the model. In the outer loop you evaluate the performance of this procedure. If the process is stable, then we'll select the same 30 features each time, we'll get the same model each time and our estimate of the error will be accurate. And because your feature selection process is stable, you can first select this 30 features using your feature selection process and then do cv and it will give you the same estimate. – woowz Dec 21 '21 at 14:27
  • So there is no difference between the two methods now. In your example, your feature selection process is unstable. When you increase the data set size you make it more stable. So this is how you can say the bias is diminished or even negligible. Do you agree? – woowz Dec 21 '21 at 14:28
  • @woowz nested and non-nested cross-validation can be expected to be asymptotically unbiased. However if we have an effectively infinite dataset then we have no need of cross-validation in the first place (I am not generally very reassured by asymptotic properties). In practical applications, however, feature selection is rarely stable and we tend to use cross-validation when data are limited (as it makes efficient use of what data we have). So if cross-validation is a sensible option for feature selection, better to use nested CV, especially as parallel (e.g. multi-core) computing is cheap. – Dikran Marsupial Dec 21 '21 at 21:45
  • @DikranMarsupial Im under the impression that for your outer loop cv results to be accurate you should try to get your inner loop procedure to be as stable as possible. So if feature selection is not stable then maybe its best to not do it. Just include all your features and do regularization, hopefully that will help with model stability? I do not know, have been thinking for hours and am going to sleep now :)) – woowz Dec 21 '21 at 22:56
  • @woowz no, the whole point of nested cross-validation is for the variability/uncertainty in the feature selection to properly accounted for in the performance estimate. If you want an accurate estimate, you need to use nested cross-validation. An accurate estimate is one with low error, and that requires you to reduce bias as well as variance. The best reason for performing feature selection is that identifying the relevant features is a specific goal of the analysis - I usually advise against it as a means of improving generalisation because usually it doesn't. – Dikran Marsupial Dec 21 '21 at 23:15
  • @DikranMarsupial It seems that you and cbeleites are at disagreement. In this thread https://stats.stackexchange.com/questions/31190/variance-estimates-in-k-fold-cross-validation?rq=1 cbeleites mentions the assumptions of cross validation. In particular that the k "surrogate" models have the same true performance (are equivalent, have stable predictions), so you are allowed to pool the results of the k tests.". In other words, you need stable results. If your feature selection is unstable, then your model built on these features will be unstable and your estimate is not accurate – woowz Dec 21 '21 at 23:32
  • @woowz no, I don't see any disagreement. That question is about the variance of cross-validation, not about the bias introduced in performance estimation by model/feature selection. I have repeatedly explained why nested cross-validation gives a more accurate performance estimate in that setting, and you have not engaged with the points raised, so I will leave it there as I have learned that the chance of productive on-line discussion is rather low when that happens. – Dikran Marsupial Dec 21 '21 at 23:48
  • @DikranMarsupial There might be a misunderstanding or its probably me not understanding something. I do understand the bias problem, which I mentioned that will be diminished with more data, and you following up by mentioning that it can be expected to be asymptotically unbiased. So I see no issues there. Thanks alot for the discussion anyway! – woowz Dec 22 '21 at 00:03
15

To add a slightly different and more general description of the problem:

If you do any kind of data-driven pre-processing, e.g.

  1. parameter optimization guided by cross validation / out-of-bootstrap
  2. dimensionality reduction with techniques like PCA or PLS to produce input for the model (e.g. PLS-LDA, PCA-LDA)
  3. ...

and want to use cross validation/out-of-bootstrap(/hold out) validation to estimate the final model's performance, the data-driven pre-processing needs to be done on the surrogate training data, i.e. separately for each surrogate model.

If the data-driven pre-processing is of type 1., this leads to "double" or "nested" cross validation: the parameter estimation is done in a cross validation using only the training set of the "outer" cross validation. The ElemStatLearn have an illustration (https://web.stanford.edu/~hastie/Papers/ESLII.pdf Page 222 of print 5).

You may say that the pre-processing is really part of the building of the model. only pre-processing that is done

  • independently for each case or
  • independently of the actual data set

can be taken out of the validation loop to save computations.

So the other way round: if your model is completely built by knowledge external to the particular data set (e.g. you decide beforehand by your expert knowledge that measurement channels 63 - 79 cannot possibly help to solve the problem, you can of course exclude these channels, build the model and cross-validate it. The same, if you do a PLS regression and decide by your experience that 3 latent variables are a reasonable choice (but do not play around whether 2 or 5 lv give better results) then you can go ahead with a normal out-of-bootstrap/cross validation.

Alvin
  • 113
  • 4
cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • Unfortunately the link for print 5 of ElemStatLearn book is not working. I was wondering if the illustration you were refering to is still on the same page. Please mention the caption too. –  May 31 '17 at 10:47
  • So, if I have two sets of data, do feature selection/engineering on one of them, and CV on the other, there would be no problems? – Milos Feb 20 '18 at 00:13
  • 1
    @Milos: no, as long as those features become fixed parameters for the models for cross-validation, that should be OK. This would be a proper hypothesis generation (= feature development on data set A) / hypothesis testing (= measuring performance of the now fixed features with data set B) setup. – cbeleites unhappy with SX Feb 20 '18 at 22:01
  • @cbeleites Yes, that is what I intended to do. Determine features on A, then fix those features and do cross-validation for the models on B. Thanks. :) – Milos Feb 21 '18 at 01:49
  • @Milos: keep in mind, though, that your argumentation for the achieved performance is even better if you fully train your model on A and then use B *only* for testing. – cbeleites unhappy with SX Feb 21 '18 at 13:02
5

Let's try to make it a little bit intuitive. Consider this example: You have a binary dependent and two binary predictors. You want a model with just one predictors. Both predictors have a chance of say 95% to be equal to the dependent and a chance of 5% to disagree with the dependent.

Now, by chance on your data one predictor equals the dependent on the whole data in 97% of the time and the other one only in 93% of the time. You will pick the predictor with 97% and build your models. In each fold of the cross-validation you will have the model dependent = predictor, because it is almost always right. Therefore you will get a cross predicted performance of 97%.

Now, you could say, ok that's just bad luck. But if the predictors are constructed as above then you have chance of 75% of at least one of them having an accuracy >95% on the whole data set and that is the one you will pick. So you have a chance of 75% to overestimate the performance.

In practice, it is not at all trivial to estimate the effect. It is entirely possible that your feature selection would select the same features in each fold as if you did it on the whole data set and then there will be no bias. The effect also becomes smaller if you have much more samples but features. It might be instructive to use both ways with your data and see how the results differ.

You could also set aside an amount of data (say 20%), use both your way and the correct way to get performance estimates by cross validating on the 80% and see which performance prediction proves more accurate when you transfer your model to the 20% of the data set aside. Note that for this to work your feature selection before CV will also have to be done just on the 80% of the data. Else it won't simulate transferring your model to data outside your sample.

Erik
  • 6,909
  • 20
  • 48
  • Could you elaborate more on the correct way of doing feature selection with your intuitive example? Thank you. – uared1776 Aug 27 '19 at 13:54