I've got data from 1000 people on a score, and a few outcome variables. Each score is formed of 10 components; to get the total score you sum the score for each component.
What I'm trying to do is simplify the score so it uses fewer paraments. At the moment, I'm using a brute force approach using all the 1000 people to go through all of the different combinations of parameters, and using a performance variable to choose the best one.
What I'm concerned about is the danger of over-fitting/the lack of generalizability using this approach. What I think I do is add cross-validation, but I'm not sure if it makes sense.
At the moment I'm thinking of doing something like this ten times.:
- Split the dataset into test & train.
- Work out best combinations of paraments using the training dataset
- Print to screen performance measure using best combinations of paraments (from step 2) using test data.
Does that make any sense?