learning R-understanding calculations for specific variable

Question

Clarify why the prediction for a female in the example below in related topics (taken from a question by @MsSnowy) do we use the new calculations and not the original lm:

fitted.model <- lm(spending ~ sex + status + income, data=spending),

my results were as follows:

Coefficients:
                       Estimate  Std. Error t value   Pr(>|t|)    
(Intercept)    22.55565   17.19680   1.312   0.1968    
sex         **-22.11833**  8.21111  -2.694   0.0101 *  
status          0.05223    0.28111   0.186   0.8535    
income          4.96198    1.02539   4.839 1.79e-05 ***
verbal         -2.95949    2.17215  -1.362   0.1803

Now, the new model is sex and all other predictors constant in lm model

mydata<-lm(spending ~ sex, data=spending)

was

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   29.775      5.498   5.415 2.28e-06 ***
**sex        **-25.909****    8.648  -2.996  0.00444 **

Question: Why isn't the prediction that females spend less than 22.118 (from the 1st lm) than males but rather the new lm coefficient value of less than -25.909?

Someone please clarify because I would think its the first lm for prediction.

This question has nothing to do with R and nothing to do with calculations - you would get the same result in any other program. The question is really about understanding regression. — Peter Flom, Sep 17 '12 at 16:58
Michelle, your question is grammatically incorrect and therefore difficult (or impossible) to understand. Could you please clarify it? — whuber, Sep 18 '12 at 16:46

score 2 · Answer 1 · answered Sep 17 '12 at 16:40

2

Without the data it is hard to tell exactly, but it is most likely that there is some relationship between sex and at least one of the other 2 variables.

For example, if in the dataset females have a lower income on average then males then we would expect to see the above. In the first model you are looking at the effect of sex above and beyond the effects of income and verbal. In the second model you are looking only at sex, so any information that would have come from income gets merged into the sex effect. The first model suggests that for male and female with the same income and verbal that the female will spend on average 22 less. In the second model you don't include the information on income and verbal so we cannot compare a male and a female that are the same on income and verbal, just the average accross all males and females. So the amount that females spend less than males includes the 22 from above, but also includes the lower spending due to lower income on average (or differences in verbal, but it is less clear what that measure is and likely differences).

The only time that two models with different sets of predictors will give the same estimates of the slopes are if the variables are perfectly orthogonal.

answered Sep 17 '12 at 16:40

Greg Snow

46,563
2
90
159

+1 for a plausable analysis and agreement with my answer to the previous question. Would you agree with Michelle that the first model is better than the second. I think a model with just income and sex would be better than either of the two above (verbal and status don't appear to add anything to explain y given that sex and income are included). – Michael R. Chernick Sep 17 '12 at 16:49
@MichaelChernick, oops I missed status. If verbal and status are correlated then it could be that a model including one of them would be better than without either, or the model with just sex and income may be best. Certainly the 1st model above is better than the 2nd for prediction if you have all the variables. I would lean towards model averaging or a penalized model rather than choosing a set of predictors, fitting the model and calling it "Best". – Greg Snow Sep 17 '12 at 16:56
reading the question she has posted, it appears that she held the other variables from what I can view (status, income, verbal) as constant and was prediciting expenditure on spending for a female and a male. So because the other are constant (assume 0) then we only have sex which gives the second lm of -25.909. – jerry Sep 17 '12 at 17:05
@GregSnow All good ideas. I am not sure exactly where I would stand regarding a final model. I would want to look at the other cases. A penalty for parameters like with AIC or BIC is always a good thing to look at. I know ensemble averaging (boosting) works well with classification (particularly tree classifiers) but i am not sure about regression especially when the list of covariates in this example is not very large. – Michael R. Chernick Sep 17 '12 at 17:17
@Whuber what I wanted to know is why the coefficient of the second linear model is used to determine that females spend less than males; instead of the original linear model? Is it because the 2nd lm only uses sex and the others were held constant. – jerry Sep 19 '12 at 13:12

learning R-understanding calculations for specific variable

1 Answers1