0

I'm playing around with data with quite few observations (~50) and several observed features. I assessed each of these features for significant correlation with the target variable and then used the significantly correlating features to build an OLS model. However, the adj. R² of the OLS is now lower than some of the correlations between the target variable and the individual predictors. I checked the residuals and they look fine.

Assuming the following situation:

  • Adj. R² of the OLS is around 0.5
  • R of some of the predictors up to 0.1 above that
  • All p-vals of the correlations are smaller than 0.001

My gut feeling would tell me to discard the OLS and use the single feature that correlates best with the target variable. Is there any way, in which the OLS could be superior to that?

Edit, 2020-07-06: Changed R² to R for the individual predictors, as JohnnieThick pointed out.

ttreis
  • 15
  • 6
  • 4
    What is your goal? What does superior mean? Some people define superior based on how good of a fit it is, in which case you may already have your answer. If you are trying to understand a relationship between a covariate and outcome, then fit is not a concern, and correctly modeling what you are interested in is more important, which may include adding controls. – doubled Jul 07 '20 at 18:19
  • Your variable selection strategy is called uni-/bivariate screening and is amongst the most biased approaches for variable selection. – Michael M Jul 07 '20 at 20:12
  • @doubled: I'm trying to understand the influence of the covariates. Thanks, I'll have a look at how my controls influence the model. – ttreis Jul 08 '20 at 05:25
  • @MichaelM: I used the correlations to select variables with a linear relationship, as this is one of the assumptions for an OLS model. Or did I understood this incorrectly? – ttreis Jul 08 '20 at 05:32
  • @ttreis i tried to answer your post, not sure if this is what you were asking – doubled Jul 08 '20 at 06:05

2 Answers2

2

Whatever you do, there will be pros and cons, and it really depends on what you want to do. However, some thoughts first on your direct question, and then some more general ideas:

Regarding your approach, the logic only works if you are restricted to select exactly one regressor variable. If you have the option to select more, as with a multivariate OLS, this falls apart, and the key reason is that some linear combination of regressors that weakly correlate with the outcome may together have a larger correlation with the outcome than a linear combination of some regressors that have individual strong correlations with the outcome. Multivariate regression is about linear combinations of regressors, not just individual effects.

Additionally, below are some thoughts perhaps worth exploring:

  1. I strongly disagree with the other answer of keeping variables based on the p-values. For a start on why you shouldn't do that, check out this CV post.

  2. More generally, you seem to be wanting to do model selection, and the approach you are currently doing is a bad way of approaching it. At noted in the comments by Michael M, you're basically doing univariate screening and it's just not good. Additionally, looking at $R^2$ may not be the best idea if you care about model selection, because $R^2$ is about how much variance a model explains, and you may want to care about parsimony and other factors

  3. Model selection is difficult to answer specifically because it depends on what you want to do and what you can do. Check out this CV post on some approaches. In particular, I'd like to highlight a comment in that post by gung, who writes (in reference to other comments/answers in that post):

Cross validation (as Nick Sabbe discusses), penalized methods (Dikran Marsupial), or choosing variables based on prior theory (Michelle) are all options. But note that variable selection is intrinsically a very difficult task. To understand why it is so potentially fraught, it may help to read my answer here: algorithms-for-automatic-model-selection [this link no longer exists]. Lastly, it's worth recognizing the problem is w/ the logical structure of this activity, not whether the computer does it for you automatically, or you do it manually for yourself.

  1. I would probably recommend you check out lasso regularization, and this post has some great info about it (and here's a follow up to that post).

  2. Finally, you may also want to explore (pun intended) the concept of exploratory data analysis.

doubled
  • 4,193
  • 1
  • 10
  • 29
  • 1
    Thank you for your very thorough answer. Especially point 4 helped me out, since user jokel seemed to have a similar problem. I'll just add his follow-up question here in which he further details his approach using Lasso: https://stats.stackexchange.com/questions/34859/how-to-present-results-of-a-lasso-using-glmnet – ttreis Jul 08 '20 at 07:09
  • Oh neat, I edited my post to include that link. Glad it helped-- best of luck! – doubled Jul 08 '20 at 07:17
  • @doubled even though I agree with most of your post, I would ask you to be a bit less dogmatic, since model-building is effectively an art. The correct approach always depends on the problem and the data available. Re p-values, the story is very different if we have orthogonal variables and n.i.d. errors vs having highly correlated "independent" variables, heteroscedasticity etc...I agree that a backward-elimination approach based on p-values will almost certainly overfit, but it is hard to argue against significance statistics being meaningful in understanding the influence of predictors. – JohnnieThick Jul 08 '20 at 10:33
0

Originally you are referring to correlations and then you refer to $R^2$ of a regression. If you are comparing the correlation coefficient $\rho$ when you correlate the variables independently with the $R^2$ you get in a regression these are different things; $\rho$ will be higher compared to the $R^2$ that you will get if you regress the dependent to that variable alone, as it its square root and $\rho$ belongs to [-1,1].

In any case, to decide which variables to keep in your model, look at the p-values of the regression when all variables are included and not at those on the correlation matrix. Of course, this is one of the very many things to consider when it comes to variable selection, with others being the increase in goodness of fit that could be attributed to each variable, the correlations between variables, interactions, the functional form and consistency of various correlations, the chance of over-fitting etc.

The answer above provides you with several interesting inputs on variable selection and it surely useful to read about different approaches. However, in my opinion, model-building is an art and there is no single recipe that works for all problems. Hence, one needs to understand the problem they are dealing with very deeply, understand any theory behind the data, understand the limitations of the given data and then consider how to best quantify the problem and approach the different trade-offs. My point is that there may be a best suited approach to several problems, but it is your job to see how these problems relate to the one you want to solve. To do that successfully, you need a deep theoretical understanding of your problem and a lot of experience in playing with data.

  • Ahhhhhhh, totally overlooked that since the linregress function of scipy.stats also calls it "rvalue" in its documentation. I checked and this is indeed the Pearson coefficient. Thanks for pointing that out! Since OLS performs best when its assumptions are met, I wanted to satisfy the "should have a linear correlation with outcome variable" constraint by checking for a significant linear correlation beforehand. Is there a better method for this? – ttreis Jul 08 '20 at 05:38
  • 1
    Please note that when you have many variables, what matters is how variable $k$ affects the dependent on top of the rest $k-1$ and not how it affects it in isolation. Hence, it could be that a variable is useless after including another variable that is very correlated with it and explains the predictor more (omitted variable bias) or it could be that one variable that does not matter on its own, it does matter, after another variable is considered. In general, particularly if you have many variables and a small sample, I would advise you to think a lot about what makes sense in theory. – JohnnieThick Jul 08 '20 at 10:59