1

I've got data from 1000 people on a score, and a few outcome variables. Each score is formed of 10 components; to get the total score you sum the score for each component.

What I'm trying to do is simplify the score so it uses fewer paraments. At the moment, I'm using a brute force approach using all the 1000 people to go through all of the different combinations of parameters, and using a performance variable to choose the best one.

What I'm concerned about is the danger of over-fitting/the lack of generalizability using this approach. What I think I do is add cross-validation, but I'm not sure if it makes sense.

At the moment I'm thinking of doing something like this ten times.:

  1. Split the dataset into test & train.
  2. Work out best combinations of paraments using the training dataset
  3. Print to screen performance measure using best combinations of paraments (from step 2) using test data.

Does that make any sense?

NotLost
  • 11
  • 1
  • Welcome to Cross Validated! [You might not need feature selection at all](https://stats.stackexchange.com/questions/555145/ridge-regression-for-multicollinearity-and-outliers/555163#555163), though the ridge and LASSO techniques I mention in my linked answer often get tuned using the cross validation you mention in your title (though the body of your question does not sounds like cross validation). – Dave Feb 25 '22 at 16:13
  • @Dave Thank you! I did think about LASSO, but the issue is I want each of the parameters to be equally weighted (i.e. at 1) for ease of use in routine clinical care. I don't think (please correct me if I'm wrong) that's possible with LASSO or ridge? – NotLost Feb 25 '22 at 16:35
  • What do you mean that you want each parameter to have equal weight? The whole point of regression is to figure out the weights, not decide them. – Dave Feb 25 '22 at 16:49
  • You would use cross-validation on step 2 to choose your "best model" within your constraints; properly done, this should reduce over-fitting. Then use all the training data on your best model, and use that single final model in step 3 to report how well it performs on the test data – Henry Feb 25 '22 at 16:54
  • @Dave I'm not doing linear regression; it's a score for use by clinicians on the fly. I'm simply trying to work out the best combination of parameters if they're all equally weighted. – NotLost Feb 26 '22 at 15:58
  • Best in what sense? – Dave Feb 26 '22 at 16:12
  • @Dave ATM just AUROC at predicting long length of hospitalisation, but there are a few other combi metrics we're gonna use. – NotLost Feb 26 '22 at 16:39

0 Answers0