21

If in a linear regression I have categorical variable ... how do I know the statistical significance of the categorical variable?

Let's say the factor $X_1$ has 10 levels ... there will be 10 different resultant t-values, under the umbrella of one factor variable $X_1$ ...

It seems to me that the statistical significance is tested for each level of the factor variable? No?

@Macro: Following your suggestion, I've built the following example:

It seems that x3 is useful and must be included in the model, from the below model comparison.

But actually that's wrong ...

    n=100    
    x1=1:n
    x2=(1:n)^2 
    x3=rnorm(n)
    ee=rnorm(n)
    y=3*x1-2*x2+x3+3+ee
    lm1=lm(y~x1+x2+x3)
    summary(lm1)
   
    lm2=lm(y~x1+x2) 
    summary(lm2)
   
    anova(lm1, lm2)

    > anova(lm1, lm2)
    Analysis of Variance Table
    
    Model 1: y ~ x1 + x2 + x3
    Model 2: y ~ x1 + x2
      Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
    1     96  82.782                                  
    2     97 146.773 -1    -63.99 74.207 1.401e-13 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Luna
  • 2,255
  • 5
  • 27
  • 38
  • 1
    @Luna, why is that wrong? It appears you used `x3` to generate the `y`s, so it should be included in the model and the $p$-value agrees with that conclusion. – Macro Jul 05 '12 at 18:46
  • @Seth - you are right. I was just giving a toy example of using anova generally in model comparison. So it's not linked to my original question. – Luna Jul 05 '12 at 19:01
  • @Macro - you are right. Now I see the point. Thank you! – Luna Jul 05 '12 at 19:02
  • The 'Anova' function from the R 'car' package ([pdf](https://cran.r-project.org/web/packages/car/car.pdf)) lets you test the overall significance of a categorical variable. It works with a lot of different packages and types of regression. – SK4ndal May 31 '18 at 13:21

1 Answers1

35

You are correct that those $p$-values only tell you whether each level's mean is significantly different from the reference level's mean. Therefore, they only tell you about the pairwise differences between the levels. To test whether the categorical predictor, as a whole, is significant is equivalent to testing whether there is any heterogeneity in the means of the levels of the predictor. When there are no other predictors in the model, this is a classical ANOVA problem.

When there are other predictors in the model. you have two options to test for the significance of a categorical predictor:

(1) The likelihood ratio test: Suppose you have an outcome $Y_i$, quantitative predictors $X_{i1}, ..., X_{ip}$ and the categorical predictor $C_i$ with $k$ levels. The model without the categorical predictor is

$$ Y_i = \beta_0 + \beta_1 X_{i1} + ... + \beta_p X_{ip} + \varepsilon_i $$

In R you can fit this model with the lm() command and extract the log likelihood with the logLik command. Call this log-likelihood $L_0$. Next, you can fit the model with the categorical predictor:

$$ Y_i = \beta_0 + \beta_1 X_{i1} + ... + \beta_p X_{ip} + \sum_{j=1}^{k-1} \alpha_j B_j + \varepsilon_i $$

where $B_j$ is a dummy variable which is $1$ if $D_i = j$ and $0$ otherwise. The $k$'th level is the reference level, which is why there are only $k-1$ terms in the sum. R will automatically do this dummy coding for you if you pass the categorical variable to lm(). You can fit this model similarly and extract the log likelihood as above. Call this log-likelihood $L_1$. Then, under the null hypothesis that $D_i$ has no effect,

$$ \lambda = 2 \left( L_1 - L_0 \right ) $$

has a $\chi^2$ distribution with $k-1$ degrees of freedom. So, you can calculate the $p$-value using 1-pchisq(2*(L1-L0),df=k-1) in R to test for significance.

(2) $F$-test: Without going into the details (which are similar to the LRT except sums of squares are used rather than log-likelihoods), I'll explain how to do this in R. If you fit the "full" model (i.e. the model with all of the predictors, including the categorical predictor) in R using the lm() command (call this g1) and the model without the categorical predictor (call this g0), then the anova(g1,g0) will test this hypothesis for you as well.

Note: both of the approaches I've mentioned here require normality of the errors. Also, the likelihood ratio test is a very general tool used for nested comparisons, which is why I mention it here (and why it occurs to me first), although the $F$-test is more familiar in comparing linear regression models.

Macro
  • 40,561
  • 8
  • 143
  • 148
  • Thanks a lot Macro. I found that my data is highly non-normal. The QQ plot is as follows: the curve is all below the straight 45 degree line. The curve is tangent to that straight line. And the curve looks like the curve of f(x)=-x^2 ( shape-wise). What kind of problem am I facing? And how shall I fix this? Thank you! – Luna Jul 05 '12 at 18:21
  • 1
    @Luna, Your data is highly non-normal or the residuals are highly non-normal? Also, I don't think it's possible for the entire set of points to lie under the 45 degree line. – Macro Jul 05 '12 at 18:23
  • oh actually you are right... I just took one more look at the QQ plot. It's not the entire set of points that's under the 45 degree line. It's the curve with the shape of f(x)=-x^2 is "tangent" to the 45 degree line. By "tangent" I should have meant that those points around the "tangent" point are actually above the 45 degree line, very slightly though. Therefore, visually speaking, most of the data (~98%) are below the 45 degree line... what shall I do first to fix this problem before doing model comparison? Thank you! – Luna Jul 05 '12 at 18:32
  • @Luna, are these the residuals or the raw data points? – Macro Jul 05 '12 at 18:34
  • These are the residual QQ plot coming out from "plot(lmModel)"... What shall I do to fix the problem? Thank you! – Luna Jul 05 '12 at 19:00
  • @Luna, I don't think that `plot(lmModel)`, where `lmModel` is a fitted `lm()` model will make a residual QQ plot. You should type `qqnorm(lmModel$res); qqline(lmModel$res)` to do that. The residual you described are left skewed. You could consider transforming the response (log maybe) to take care of that, but that's really an entirely different question from the one posed in this thread. – Macro Jul 05 '12 at 19:08
  • Are there methods for model comparison that doesn't depend on normality? – Luna Jul 05 '12 at 19:08
  • 2
    If your sample size is pretty large, the $p$-values should still be reasonable (by the central limit theorem) as long as your error distribution isn't long-tailed. If you just want to test the categorical variable in isolation, you can use a non-parametric ANOVA: http://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance but, as I said, this is really becoming an entirely different question from the one posed and may be more appropriately posed as a new question or answered by searching the site for a related question. – Macro Jul 05 '12 at 19:11
  • If i want to test the joint significance of two categorial variables do i exclude both in the $g_0$ model and test it against the model $g_1$ which includes both categorial variables? – Druss2k Dec 19 '12 at 04:05
  • 1
    @Druss2k, yes that's correct. – Macro Dec 19 '12 at 21:09
  • @Macro Nice answer. My question is that my model only has one predictor and its categorical with 9 levels. How exactly do I perform the ANOVA on this? Would it be as simple as `response ~ 1` (i.e., just including some intercept?) – masfenix Aug 19 '15 at 17:19