How to choose predictors for regression model

Question

I want to predict reaction times using several personality scores. I have 9 different personality scores. My sample consists of 23 participants. Aren't there too many predictors if I put all 9 in one model, especially regarding the small sample size? If not, how could I choose predictor? The theory does not state any predictor as more important than the others. the predictors also have low intercorrelation, i.e., multicolinearity is low even with 9 predictors in the model. One thought of mine would be to see which personality score has a significant correlation with the reaction times and only use them in a model. What is your opinion?

You are considering finding "which personality score has a significant correlation". If you do that 9 times with p=.05, you will be inflating your alpha level considerably. Also, when you put only those variables in your model you will be capitalizing on chance. That will mean that any predictions you make using your model may be inflated. For exploratory purposes, that might be ok. (But then, most anything is fine for exploratory purposes.) — Joel W., Jun 06 '14 at 22:00

score 3 · Answer 1 · answered Jun 06 '14 at 11:38

3

By the usual rules, 9 independent variables is far too many with N = 23. The problem isn't collinearity, but overfitting. There is a rule of thumb of 10 subjects for each IV; that would tell you 2 IVs, maximum. With what you've got, I'd probably look at each IV, one at a time, and leave it at that.

Overfitting would mean that you would get p < 0.05 much more than 1 in 20 times, even if all the data was random noise.

I am not sure, though, how solid the results are justifying that rule of thumb. Others here may have pointers to the literature.

answered Jun 06 '14 at 11:38

Peter Flom

94,055
35
143
276

2

Another rule of thumb for multiple linear regression is *n* = 50 + 8(*p*), where *n* is sample size and *p* is the number of predictor variables, so I completely agree with @Peter Flom to perform univariate analyses, and leave it at that. – Matt Reichenbach Jun 06 '14 at 14:10
overfitting is meant for the F-test, because many predictors are almost always better than then intercept model or because there are so many t-tests for the predictors? – 00schneider Jun 07 '14 at 16:58

score 3 · Answer 2 · answered Jun 07 '14 at 02:43

3

There is, in fact, very close to nothing you can do in this situation except get more data. The rule of thumb of 10 data for every variable is not written on a stone tablet, but it is a reasonable guideline here. You will need at least N = 90.

The idea to check bivariate correlations and then enter only those with significant results into the model is a common one, but it is invalid.

You could run bivariate correlations and stop there, but you should use the Bonferroni correction to control for the type I error inflation associated with running 9 tests. Even if there are some real effects amongst your variables, it is unlikely you will be able to find them. In addition, the non-significant results will not meaningfully imply low to no relationship because your power will be so low. These tests almost certainly will yield no information.

Another possibility is to use the LASSO, but that may be too advanced. If you cannot get any more data, you may need to work with a statistical consultant.

answered Jun 07 '14 at 02:43

gung - Reinstate Monica

132,789
81
357
650

Could explain why the idea in the second paragraph is invalid? If some IVs correlate well with the DV, and if there is no collinearity between these selected IVs, can't they be used for modelling? – Andre Silva Jun 07 '14 at 03:00
1

@AndreSilva: if the goal is to predict the 24th observation, sure, that's about as good as one can do...but if the goal is to infer general principles about the population, it's basically begging for a false positive result to arise from sampling error alone. – Nick Stauner Jun 07 '14 at 03:05
1

@AndreSilva, you can get the idea from my answer here: [Algorithms for automatic model selection](http://stats.stackexchange.com/a/20856/7290). In brief, you are selecting variables based on a random variable. The variables you select will be too large & the variables you don't use (in effect have their betas forced to 0) will be too small. Also, as NickStauner points out, the p-values will be invalid. – gung - Reinstate Monica Jun 07 '14 at 03:19
1

Tks @gung, I've upvoted that one (and this too). I was more thinking maybe one could have coherent previous hypothesis of relationship between the response and multiple IVs (to justify data collection). Then, using two or four of them without multicollinearity could be ok. – Andre Silva Jun 07 '14 at 03:34
1

@AndreSilva, if you have a coherent (or even not) *previous* hypothesis then you could use those two (four would be stretching it, but possible) w/o too much of a problem. The OP seems to suggest that isn't the case here, though. – gung - Reinstate Monica Jun 07 '14 at 03:40
thanks for your thoughts. the personality variables have been selected out of theoretical implications. would it be more appropriate to select the three most promising and put them in a model? besides that, if I look in a table for critical correlations, it is around .30 for 23 participants. I am looking for medium effects, therefore the power should be alright, isn't it? – 00schneider Jun 07 '14 at 16:51
If you selected 3 prior to looking at the data, you would probably be OK, but I still wouldn't expect much. – gung - Reinstate Monica Jun 08 '14 at 02:26
sorry, but I still need more elaboration. the critical sample size for r = .30 is around twentysomething participants. therefore, should be able to detect an effect if it exists, right? If I have three variables, I have a little bit less than the recommended sample size for three predictors (which would be 3*10 = 30). why would you no expect much? If do not chose the predictors by significant correlation (because of alpha inflation) but because I think they are the most promising and if I look for at least medium effects (.30), it should be alright in my humble opinion? – 00schneider Jun 08 '14 at 09:07

How to choose predictors for regression model

2 Answers2