How to gain knowledge from dataset using regressions in R

Question

I have some data: outcome is satisfaction and there are four predictors, three continuous (age, weight, height) and one factor, graduated high school or not.

In R, I have uploaded the data set, and set $X1$ for age, $X2$ for weight , $X3$ for the factor and $X4$ for height.

I want to know if there is evidence that graduating high school has an effect on satisfaction.

I know that I can not simply look at lm(y~x3), because I need to consider all the other possibilities. So how do I take all of these into account? How many models must I check? What is the general approach to this?

Also, would I need to consider any and all possible interactions?

Call:
lm(formula = y ~ x1 + x2 + x3 + x4)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.506  -5.096   1.306   4.738  28.722 

Coefficients:
            Estimate     Std. Error      t value       Pr(>|t|)    
(Intercept) 140.1689      8.3191          16.849      2.77e-13 

x1           -1.1428     0.1904          -6.002       7.22e-06 

x2           -0.4699     0.1866           -2.518     0.0204 

x3yes         2.2259     4.1402            0.538     0.5968    

x4            1.2673     1.4922     0.849      0.4058    


Residual standard error: 9.921 on 20 degrees of freedom
Multiple R-squared:  0.8183,    Adjusted R-squared:  0.7819 
F-statistic: 22.51 on 4 and 20 DF,  p-value: 3.611e-07

please create a complete model - `lm(y~x1+x2+x3+x4)` or just `lm(y~.)` and edit the output to your question. — Yuval Spiegler, Nov 27 '16 at 19:22
@YuvalSp Can you please explain what you would like me to do? I dont understand from the comment — Quality, Nov 27 '16 at 19:37
fit the regression model with all the covariates and show the model summary: `summary(lm(y~x1+x2+x3+x4, data=dataframe))` where dataframe is your data frame name. Then copy the resulting output and insert it to your question. Then we can help explain the output and try to help you. Its much easier when we can see the actual data. — Yuval Spiegler, Nov 27 '16 at 19:52
Instead of `x1`, `x2`, `x3`, and `x4`, why not call the varaibles `age`, `weight`, `height`, and `graduated`? — Matthew Drury, Nov 27 '16 at 22:39

Matthew Drury · Accepted Answer · 2016-11-29T06:08:22.940

The most important thing to do is for you to check if the model makes sense. You have fit a linear model to three continuous predictors, you need to make sure that it makes sense to do so You should look at scatterplots of age, height, and weight against y, and adjust the fits of these predictors if needed.

Assuming fitting these predictors linearly is reasonable, fitting the full model with all four predictors is a sensible thing to do.

You have only 25 data points. If you go on a long search through the space of all models (adding and removing variables) you have an extremely high risk of false positives. So, I don't think there is much need to backwards select out variables; if you wish to do so, make sure you use cross validation to make sure doing so improves the fit of the model to unseen data.

The same thing applies to a search for interactions, you have little data, and you are running a large risk of false positives.

If you wish to make inferences using the estimated confidence intervals, you should additionally check a plot of the residuals vs. the fitted values of the model and make sure you do not see any patterns. You're looking to see if they look like they could have been drawn from a normal distribution with constant variance. If this looks reasonably consistent with yor data, then you can make inferences about the graduation parameter using the linear model

Coefficients:
            Estimate     Std. Error      t value       Pr(>|t|)    
(Intercept) 140.1689     8.3191          16.849        2.77e-13     
x1           -1.1428     0.1904          -6.002        7.22e-06     
x2           -0.4699     0.1866          -2.518        0.0204     
x3yes         2.2259     4.1402           0.538        0.5968        
x4            1.2673     1.4922           0.849        0.4058

The x3 variable measures graduation, and its parameter lies well within the error of its estimation. So, given that everything above checks out, the data you used to train the model is not inconsistent with the effect of graduation being indistinguishable from noise.

Thanks so are we really able to judge this just from fitting the full model?

As long as all the caveats are met, I do think the best way to go about this is to fit the full model, and make your inference from that. Like I said, any inference you draw from a model that does variable selection is likely to occurr by chance.

Another way to think about this is, if you go through a variable selection algorithm, the standard errors reported in the model are no longer correct, they are actually much larger than what is reported. To estimate the true standard errors of the parameter estimates under a selection / fitting procedure, you would need to use either nested cross validation or a bootstrap + cross validation. This would drive your data very, very thin, and incur a lot of variance (you are making lots of decisions, each has a chance to be wrong). Your standard errors would be enormous.

Sorry I am a bit confused by the last few lines, are you saying that graduation is not needed? I can post more codes after to, which should I include? — Quality, Nov 27 '16 at 23:12
I believe you said in the question that "I want to know if there is evidence that graduating high school has an effect on satisfaction." That's what I'm getting at there. — Matthew Drury, Nov 27 '16 at 23:15
Thanks so are we really able to judge this just from fitting the full model? — Quality, Nov 28 '16 at 00:08
Also, x3yes is in the error of estimation, but isnt the intercept x3 no which we cant reject the null? — Quality, Nov 29 '16 at 01:20
I edited in an answer to your first question. I'm not sure what your second comment means, have I misinterpreted the meaning of `x3$yes` (I think it is an indicator meaning "has graduated")? — Matthew Drury, Nov 29 '16 at 06:04
Not quite. What is shown for the parameter `x3$yes` is the difference in `satisfaction` between an individual who did graduate and an individual who didn't, holding all their other attributes constant (The value of the intercept is generally not interpretable). Therefore, testing the hypothesis that the parameter estimate for `x3$yes` is non-zero is the same as testing if the model detects a difference in satisfaction between individuals that did or did not graduate that is unlikely to have occurred due to chance. — Matthew Drury, Nov 29 '16 at 06:32
Thanks, very helpful in making me understand how to interpret the data — Quality, Nov 29 '16 at 06:33

Michael R. Chernick · Answer 2 · 2016-11-28T19:37:31.260

-2

Because there can be dependencies between the predictor variables it is possible that say X1 looks significant when X2 is left out but because X1 and X2 are highly depend X1 may appear non-significant when X2 is included in the model. With four predictor variablesthere are 2^4 -1 possible non-empty models. As this is only 15, it is not difficult to look at all subsets. If the number of variables were much larger a step-wise approach should be adequate. If possible pick a model where all the coefficients are significant and if you have 2 highly correlated variables make sure that one is excluded.

edited Nov 28 '16 at 19:37

answered Nov 26 '16 at 17:30

Michael R. Chernick

39,640
28
74
143

Thanks, I am still having trouble understanding the output though. For example, say I just do lm(y~x3) , the factor levels are no and yes, (no is baseline from alphabetical) . so when I do this I get output, the estimate for B0=70 for example with a large t value, and then the next is x3yes with an estimate of -5 and a high probability that we reject, so we reject that B1 contrast is 0, but Im not sure what that is really saying – Quality Nov 26 '16 at 17:36
Also I know there are 15 possible, but do we not need to consider any sort of interaction? – Quality Nov 26 '16 at 17:49
3

How is @Quality to choose one of these 15 possible models? – Matthew Drury Nov 27 '16 at 22:40
Matthew I have edited my answer to suggest how the model should be chosen. – Michael R. Chernick Nov 28 '16 at 17:07
2

'Try them all, & pick the one where the coefficients are significant' seems like poor modeling advice, IMO. – gung - Reinstate Monica Nov 28 '16 at 17:09
@gung: Why do you say that. Since there are only 15 models it is fairly easy to look at them all. This is better than having to use forward backward or stepwise selection which might not consider all models. Generally variables with non-significant coefficients do not help prediction and hence should be left out barring some subject matter reason. – Michael R. Chernick Nov 28 '16 at 17:47
2

See my answer here: [Algorithms for automatic model selection](http://stats.stackexchange.com/a/20856/7290). Whether it is 'easy' or 'hard' to look at all possible models is irrelevant. NB, at the link from which I copied the list of problems, it says, '“All possible subsets” regression solves none of these problems'. – gung - Reinstate Monica Nov 28 '16 at 18:33
@gung Number of models possible is not irrelevant. When the number of models is very large it is not easy to look at all models. When you apply forward selection, backward selection or step-wise selection it is possible to miss the best model. I am not saying to look at all models and pick the one with the highest R square. That would probably be unwise. Adjusted R square or AIC might be better to look at. Why do you say that looking at all subsets is not helpful? – Michael R. Chernick Nov 28 '16 at 19:35
2

I explain that in my linked answer. NB, the statement about all subsets comes from the link (at the link), & is from Frank Harrell. You will want to read the conversation s/ probabilityislogic in the comments below the answer. He was right & I was wrong. – gung - Reinstate Monica Nov 28 '16 at 19:41
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/49254/discussion-between-michael-chernick-and-gung). – Michael R. Chernick Nov 28 '16 at 19:47
I'm not really interested in having a lengthy discussion here. I've made the points I wanted to make, & I've pointed to where they are explained at length. – gung - Reinstate Monica Nov 28 '16 at 19:59

How to gain knowledge from dataset using regressions in R

2 Answers2