At the hospital I work at we were writing a paper on what variables about a patient predict whether they'll return for a follow-up visit. We included variables such as age, gender, distance from their home to the hospital, mechanism of injury and other things like that. We had about 600 patients to examine and so we ran a multiple logistic regression with yes/no return as the outcome, and we did this with everyone in our dataset (everyone with that condition at our hospital).
Well we wrote the paper and then someone decided we should try to create an online prediction tool. You could put in variables about a patient, and it would return a guess about whether or not the patient would return, based on our previous regression model. To help me create an online prediction tool I've used this tutorial using R and Shiny and I noticed the author split his data into training and testing sets
Problem is: I never did that. Reading comments such as this I think I understand why someone would split their data, but my question now is:
What can/should I do about it?
- I've already used all my data. Would it be best to delete everything I've done, go back, split the data and start over? (We didn't publish the paper or anything)
- Should I just proceed? Can an argument be made for NOT splitting the dataset?