10

Say I build a linear regression model to identify linear dependencies between variables in my data. Some of these variables are categorical variables.

  1. If I want to evaluate the contribution of a given predictor, how do I evaluate it? Can I compare the coefficients directly? I read in the answers that the |t| value gives us a sense of the strength of this predictor, how is this exactly?

  2. I understand that for a given category with K values, only K-1 dummy variables are created and that this is standard to avoid obvious multi-collinearity, but how can I still identify the contribution associated with the values (predictors) dropped?

Here is the model:

mod = smf.ols('dependent ~ first_category + second_category + object_price', data=df).fit()

And the output

mod.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:              dependent   R-squared:                       0.227
Model:                            OLS   Adj. R-squared:                  0.226
Method:                 Least Squares   F-statistic:                     261.7
Date:                Thu, 04 Sep 2014   Prob (F-statistic):               0.00
Time:                        14:59:24   Log-Likelihood:                -86099.
No. Observations:               17866   AIC:                         1.722e+05
Df Residuals:                   17845   BIC:                         1.724e+05
Df Model:                          20                                         
===========================================================================================
                              coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------
Intercept                    27.6888      1.017     27.235      0.000        25.696    29.682
first_category[T.o]          -1.3250      0.848     -1.562      0.118        -2.987     0.337
first_category[T.v]         -10.4557      1.125     -9.294      0.000       -12.661    -8.251
second_category[T.SL0004]    21.9987      0.808     27.213      0.000        20.414    23.583
second_category[T.SL0005]    -2.3710      2.458     -0.965      0.335        -7.188     2.446
second_category[T.SL0006]     7.2716      3.609      2.015      0.044         0.197    14.346
second_category[T.SL0007]    20.1545      1.495     13.482      0.000        17.224    23.085
second_category[T.SL0008]    13.3333      0.794     16.788      0.000        11.777    14.890
second_category[T.SL0009]    18.5605      2.189      8.478      0.000        14.270    22.851
second_category[T.SL0010]     6.7351      1.158      5.817      0.000         4.465     9.005
second_category[T.SL0011]     2.6791      0.689      3.888      0.000         1.329     4.030
second_category[T.SL0012]    -0.8159      3.811     -0.214      0.830        -8.285     6.654
second_category[T.SL0014]     8.2550     11.359      0.727      0.467       -14.010    30.520
second_category[T.SL0016]     1.6220      1.229      1.320      0.187        -0.787     4.031
second_category[T.SL0017]   -14.3253      2.642     -5.422      0.000       -19.504    -9.147
second_category[T.SL0018]     1.4823      3.193      0.464      0.643        -4.777     7.741
second_category[T.SL0019]    20.0228      2.850      7.024      0.000        14.436    25.610
second_category[T.SL0020]   -11.7478      8.691     -1.352      0.176       -28.782     5.287
budget                       -0.5682      0.014    -40.828      0.000        -0.595    -0.541
object_price                  0.0037      0.000     33.192      0.000         0.003     0.004
hour                         -0.9244      0.040    -23.244      0.000        -1.002    -0.846
==============================================================================
Omnibus:                     2997.054   Durbin-Watson:                   1.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4758.803
Skew:                           1.183   Prob(JB):                         0.00
Kurtosis:                       3.892   Cond. No.                     1.59e+05
==============================================================================

Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110

1 Answers1

15

Some quick answers...

  1. It means basically nothing to compare the values of the regression coefficients unless the predictors are standardized and the model is specified correctly, especially when the predictors are inter-correlated (which is definitely the case - look at the warning at the bottom of the output). Just see what happens to the coefficients if you drop one of the predictors from the model. Chances are, one or more of them changes radically, possibly even changing sign. Generally speaking, the coefficients tell you about the additional contribution of each variables, given the others in the model.

    You can assess the strength of each variable's contribution by the absolute value of each $t$ statistic. The one with the greatest $\lvert t\rvert$ makes the greatest contribution. It can be shown that the $t$ statistic, squared, is equal to the $F$ statistic based on the model-r eduction principle, whereby you remove one predictor from the model and measure how much the $SSE$ increases. If it increases a lot, then that predictor must be pretty important because including it accounts for a lot of unexplained variation. The F statistics are all proportional to those $SSE$ changes, so the one with the biggest $|t|=\sqrt{t^2} = \sqrt{F}$ is the one that makes the most difference.

  2. You haven't dropped anything; you have just chosen a parameterization. You will obtain exactly the same predictions regardless of which indicator is dropped. The interpretation of each regression coefficient is that it is the amount by which the prediction changes from the prediction obtained for the category whose indicator was dropped.

    To get a better idea of relative weights, I suggest using, instead of $k-1$ indicators, the variables $x_1=I_1-I_k, x_2=I_2-I_k,...,x_{k-1}=I_{k-1}-I_k$ where the $I_i$ are the indicators. The coefficient $b_i$ of of $x_i$ is then an estimate of the effect of the $i$th category minus the average of all $k$ of them; and you can obtain the analogous effect for the $k$th level by the fact that $b_1+b_2+\cdots+b_k=0$, thus $b_k=-(b_1+b_2+\cdots b_{k-1})$. The variables $x_i$ are called sum-to-zero contrasts (in R, you get them using "contr.sum", but it doesn't look like that's what you're using).

Russ Lenth
  • 15,161
  • 20
  • 53
  • Thanks. Assuming that I change my model/variables and the multi-collinearity problem goes away is there any way to rank predictors (see which one correlates best with the dependent variable) from looking at result? In logistic regression the coefficient is indicative of this correlation. Is this not the case in linear regression? – Amelio Vazquez-Reina Sep 04 '14 at 21:34
  • 3
    FWIW, I believe the OP is using Python w/ the `statsmodels` package. – gung - Reinstate Monica Sep 04 '14 at 23:02
  • 2
    I agree with answer and upvoted. However, I might not say that the coefficients mean nothing. Many packages automatically produce standardized coefficients in the belief that they mean something. Certainly the interpretation depends on a correctly specified model. – charles Sep 05 '14 at 00:05
  • 2
    @user023472, the coefficients tell you about the additional contribution of each variables, given the others in the model. They work together. That's true in logistic regression too, so maybe you need to check back on some examples where you thought you could interpret the coefficients as correlations. You can assess the strength of each variable's contribution by the absolute value of each $t$ statistic. The one with the greatest $t$ makes the greatest contribution. – Russ Lenth Sep 05 '14 at 01:20
  • 2
    Thanks @RussLenth -- This is great. When you said the one with the greatest `|t|` makes the greatest contribution. I almost understand this. I know that `t` checks the departure of the coefficient from its (null-hypothesis) value (would this null-hypothesis value be 0?). I am also familiar with t tests in general (tests for location parameters), but why is it that a greater |t| is indicative of contribution? How do you make that last leap? – Amelio Vazquez-Reina Sep 05 '14 at 01:30
  • 1
    It works because it can be shown that the $t$ statistic, squared, is equal to the $F$ statistic based on the model-r eduction principle, whereby you remove one predictor from the model and measure how much the $SSE$ increases. If it increases a lot, then that predictor must be pretty important because including it accounts for a lot of unexplained variation. The $F$ statistics are all proportional to those SSE changes, so the one with the biggest $|t|=\sqrt{t^2}=\sqrt{F}$ is the one that makes the most difference. – Russ Lenth Sep 05 '14 at 01:41