I've already used my entire dataset in a regression, should I not use that as a prediction model?

Question

At the hospital I work at we were writing a paper on what variables about a patient predict whether they'll return for a follow-up visit. We included variables such as age, gender, distance from their home to the hospital, mechanism of injury and other things like that. We had about 600 patients to examine and so we ran a multiple logistic regression with yes/no return as the outcome, and we did this with everyone in our dataset (everyone with that condition at our hospital).

Well we wrote the paper and then someone decided we should try to create an online prediction tool. You could put in variables about a patient, and it would return a guess about whether or not the patient would return, based on our previous regression model. To help me create an online prediction tool I've used this tutorial using R and Shiny and I noticed the author split his data into training and testing sets

Problem is: I never did that. Reading comments such as this I think I understand why someone would split their data, but my question now is:

What can/should I do about it?

I've already used all my data. Would it be best to delete everything I've done, go back, split the data and start over? (We didn't publish the paper or anything)
Should I just proceed? Can an argument be made for NOT splitting the dataset?

A model trained on more data is usually better than one trained on less, so training on the full data should give you the "best" model - your prediction model shouldn't need to change. The problem is that without an independent test set, you cannot estimate how good that model actually is in unseen data. — Nuclear Hoagie, Oct 26 '21 at 12:53
I cannot stress @NuclearHoagie point (or see also [my answer](https://stats.stackexchange.com/a/549833/112493) enough: an unbiased estimate of the performance (meaning for example a test/train split) is the 101 of data science. I would even argue it's so common that estimating the performance in an unbiased manner (so for example not using a test/train split) is not even mentioned explicitly (while *not* doing it would be). — Mayou36, Oct 26 '21 at 15:58
You might also find analysing your model's residuals and diagnostics in more depth helpful, as bootstrapping approaches can provide useful insights about overfitting (or not) – see https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html for an excellent introduction. (I'm unsure whether this deserves to be expanded into an answer!) — Landak, Oct 26 '21 at 17:21

score 14 · Accepted Answer · answered Oct 25 '21 at 19:46

14

With so few cases, train/test splits aren't helpful. You then lose power in training the model and precision in testing it.

What you've done so far is fine. You could go on to estimate how well the model is likely to work for prediction by repeating the modeling on multiple bootstrap samples of the data and evaluating performance of those models on the full data set. That's an accepted way to evaluate the performance of your modeling process.

One caution: "whether they'll return for a follow-up visit" might not be an all-or-none result. If you deliberately restricted consideration to returning during a fixed period of time like 1 year that could be OK, but in general you might also be interested in how soon they return and you might also want to take advantage of information from individuals who haven't yet been followed up for that fixed period of time. For those sorts of things you would need to use a survival model instead of logistic regression.

answered Oct 25 '21 at 19:46

EdM

57,766
7
66
187

Since you're mentioning the low statistics for a train/test split, what about cross-validation? wouldn't that solve the issue? – Mayou36 Oct 26 '21 at 12:25
1

@Mayou36 cross validation is a way to address the issue, but the lack of precision from the small test groups in each fold means you need to perform many repeated rounds of cross validation to get good estimates. Frank Harrell estimates that you need about 50 repetitions of 10-fold cross validation. Bootstrap validation, if you repeat all model-building steps including predictor selection for each bootstrap sample, is a direct, efficient validation of modeling the entire data sample. See the resampling sections of [Harrell's course notes](http://hbiostat.org/doc/rms.pdf). – EdM Oct 26 '21 at 12:43
Bootstrapping is oc a valid way, but it tends to have a smaller variance and a larger bias. In practice, I am not convinced that 50 rounds of cross validation are needed, especially not without the sample size (which affects your maximum possible precision). For real world scenarios, CV should be good enough. But importantly, either CV or bootstrapping (not just "no splitting or resampling") – Mayou36 Oct 26 '21 at 12:58
@Mayou36 bias in bootstrap validation represents an estimate of bias introduced by the modeling procedure, so I see that as an advantage. In terms of precision of test estimates, 50 CV rounds use all the data points 50 times in test sets; 50 bootstrap samples with validation of resulting models against the full data set also use all the data points 50 times in test sets. That's how I think about the need for repeated CV, although with a large enough data set you might need fewer rounds. I agree that some type of resampling is important for validation, as implied in the second paragraph. – EdM Oct 26 '21 at 14:52
I agree, absolutely! But isn't then the answer "What you've done so far is fine. " contradictory to "I agree that some type of resampling is important for validation"? I would suggest to rewrite that? – Mayou36 Oct 27 '21 at 11:29

score 7 · Answer 2 · answered Oct 25 '21 at 19:46

7

In my opinion the best course of action, if possible, is to collect more data and then use that data to check your current model as well as maybe top 5 of the previous models you tried.

Continuing with the child learning multiplication example from the comment you referenced - each of your models is a different child. You set up a procedure which ranks children according to how well they perform multiplication on the data they have already seen. This procedure is biased towards those who memorised the table. The child who best learned how to multiply might rank (e.g.) third from the top or even lower. So the only way to select models that will perform well outside of the data you have so far is to put them to the test using a new set of data (suitable named "testing" data).

If getting more data is impossible you can always do cross-validation. But here you will have to re-estimate your models. You can learn more about cross-validation by looking at the relevant answers on this site, but the idea is to simulate training/testing splits while still using all of the training data.

If you cannot even adjust the original analysis then the last best thing might be to select a well-enough performing model that uses the least number of variables. For example, if one model reaches 76% accuracy using 30 variables, and another one reaches 72% wile using only 10 - it is less likely for the lesser model to have "memorized" the data. So we would expect that model to perform better on new patients.

answered Oct 25 '21 at 19:46

Karolis Koncevičius

4,282
7
30
47

1

"to collect more data" is probably not helpful: to make an impact, this means to collect a factor of 10 more or such. And that is not likely to happen. Why not use parts of the already collected data and redo?? – Mayou36 Oct 26 '21 at 12:24
1

@Mayou36 Yea realistically that just won't happen. It may take months/years to get more of these patients. – Joe Crozier Oct 26 '21 at 12:32
1

@Mayou36 you don't need a factor of 10 more data, just need enough to test the performance of already established models. A 100 samples might do. If you have 90% accuracy on the training set, but only get 60 out of 100 right on a new set, I bet that would be significantly below 90%. – Karolis Koncevičius Oct 26 '21 at 14:56
But then you could also just redo with a test/train split instead, right? Generating data usually takes orders of magnitude more time than do a train/test split. – Mayou36 Oct 26 '21 at 15:54
1

@Mayou36 Redoing the analysis with a test/train split risks decisions being contaminated by having already seen all the data and having built a preferred model. Collecting more data is the same as saying "you have your predictive model which you are going to use on new patients, so let's see how well it does" and is the natural thing to do going forward. – Henry Oct 27 '21 at 09:51
@Henry this is true if you build the model by hand and therefore the model builder (you) has seen the data. But what OP is doing is to optimize a model (blackbox-like) on a given dataset. So using a subsample of the data to build a model again won't bias the model (as the algorithm that builds the model (fits/optimizes) won't see the full data) – Mayou36 Oct 27 '21 at 10:11
@Mayou36 I think you are too quick to proclaim that "using a subsample of the data to build a model again won't bias the model". If that were the case everyone would use their whole dataset to find a model and test it with cross-validation, without splitting into training and testing. – Karolis Koncevičius Oct 27 '21 at 16:18
So I am using the term "CV" where I may should use K folding, it's used here as an equivalent of test/train (or validation/train or all three) but instead of splitting and applying, you do that n times. The result is the same if the stats is high enough. Not everyone does K folding because it takes n times the time and is not necessary if the statistics is high enough – Mayou36 Nov 01 '21 at 11:32

score 5 · Answer 3 · answered Oct 26 '21 at 08:35

With something like a dozen variables to start with and then several tries on which model works best but only 600 data points on a binary output you have a severe risk of overfitting. That is your model works very well on the data you have but maybe its predictive power for new patients is not very good.

What you can do with splitting the data is getting a feel for how much of an issue that is for your specific data. I would not throw away what you have but if you have programmed this in R it should be relatively easy to split the data and check whether you have overfitting.

So split the data randomly into say 500 patients in training and 100 in testing and then look at the following:

is the best model on this set of 500 patients the same as on the whole set of 600?
how much worse does it perform on the test cases than on the training cases?
how much better is your complicated model relative to a model that only uses the single variable that is the best predictor?

Repeat this with different random choices of the split into training and testing. The goal is to gain an understanding whether your model is only a good fit on your existing data or whether it is actually a good tool to predict the behavior of future patients.

Mayou36 · Answer 4 · 2021-10-26T15:50:28.120

I disagree with the consens that this is fine. I think it's not, because I can construct a better model than you did: a hashtable that remembers the entries.

Performance estimation

Having a model created from data is fine, it's the first step. But this itself is worthless: we need to assess the performance of the model (otherwise, a random model may be better).

To assess the performance of the model (and distinguish it from random guessing), we need some data.

Overfitting

Now you could just use the same data as you already used to assess the performance. But this can lead to a bias: your model could have overfitted to the data and, in an extreme case, "remember" (hash table) the data. (remark: this is indeed less of an issue with less powerful models such as logistic regression)

Low statistics: cross validation

As mentioned, ways out are to use resampling methods that do not really reduce the sample size used: This can be done by either using bootstrap methods or cross validation. They have their own advantages and disadvantages, however they tend to perform similar in most real-world cases.

*I would suggest you to do Cross-validation to get an estimate, but then, any technique is fine.

Why you need this

For a paper, where you claim to have developed a model, it seems crucial to provide an unbiased estimate of the performance of the model (for completeness: or a very strong theoretical motivation).

I understand that it means you would need to redo some stuff, but that should not be a lot actually. And it is also worth to maybe contact someone who understands more of it: if you say you just tried a bit around, there are many many more pitfalls (and possible improvements) that you can have. Data scientists are a thing these days ;)

I've already used my entire dataset in a regression, should I not use that as a prediction model?

4 Answers4

Performance estimation

Overfitting

Low statistics: cross validation

Why you need this