4

I have the following data:

                       Somkers
                     Yes   No  All
Groups   ]10,20]      35   6   41
         ]20,30]      20  13   33
         ]30,40]      10  15   25
         ]40,50]      15   9   24
         All          80  43   123

I understand that this type of data is a 4 by 2 table and a logistic regression can be applied here. My objective here is to test whether the smokers and non smokers have the same distribution in the 4 groups. What test is more suitable here: a homogeneity test, or Fisher's exact test? Do I have to fully specify the logistic model in order to be able to perform these tests?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Enthusiastic
  • 629
  • 4
  • 13

1 Answers1

6

A chi-squared test will be simplest and most appropriate. Fisher's exact test tests for differences conditional on fixed margins, which is almost certainly inappropriate here. Logistic regression would be fine, but chi-squared would be simpler; also, LR is really assessing smoking as a function of your groups, which does not quite conceptually match your real question.

d = read.table(text="Groups     Yes   No  All
                    ']10,20]'    35    6   41
                    ']20,30]'    20   13   33
                    ']30,40]'    10   15   25
                    ']40,50]'    15    9   24", header=T)

tab = as.table(as.matrix(d[,-c(1,4)]))
names(dimnames(tab)) = c("Groups", "Smoker")
rownames(tab)        = d[,1]
colnames(tab)        = names(d)[2:3]

chisq.test(tab)
#   Pearson's Chi-squared test
# 
# data:  tab
# X-squared = 14.697, df = 3, p-value = 0.002095

Let me add a couple additional notes: The chi-squared test only gives you a p-value for the null that the distributions are the same. You may want to characterize how they differ. A couple ways to do this would be to make a table of column-wise proportions, and to make a mosaic plot:

round(prop.table(tab, 2), 3)
#          Smoker
# Groups      Yes    No
#   ]10,20] 0.438 0.140
#   ]20,30] 0.250 0.302
#   ]30,40] 0.125 0.349
#   ]40,50] 0.188 0.209
windows()
  mosaicplot(t(tab), shade=T)

enter image description here

From these, you can see that there are 'too few' non-smokers in the ]10,20] group, and 'too many' in the ]30,40] group.


If you wanted to use a logistic regression to test these data. It is very simple:

mod = glm(cbind(Yes, No)~Groups, d, family=binomial)
summary(mod)
# Call:
# glm(formula = cbind(Yes, No) ~ Groups, family = binomial, data = d)
# 
# Deviance Residuals: 
# [1]  0  0  0  0
# 
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)     1.7636     0.4419   3.991 6.57e-05 ***
# Groups]20,30]  -1.3328     0.5676  -2.348 0.018866 *  
# Groups]30,40]  -2.1691     0.6016  -3.606 0.000311 ***
# Groups]40,50]  -1.2528     0.6108  -2.051 0.040249 *  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1.5415e+01  on 3  degrees of freedom
# Residual deviance: 2.4425e-15  on 0  degrees of freedom
# AIC: 22.657
# 
# Number of Fisher Scoring iterations: 3

The test of Groups is not the same as any of the individual tests displayed in the summary output. With only one variable in the model, the test of the variable as a whole is the same as the test of the model as a whole. Unfortunately, R does not give you that by default here as it does for a linear model. You can use the null and residual deviances and degrees of freedom to get a likelihood ratio test, though:

1-pchisq(mod$null.deviance-deviance(mod), df=mod$df.null-mod$df.residual)
# [1] 0.001494036

With only one variable in the model, a convenient way to get the test of the model as a whole is to use anova.glm(). By setting test="LRT", you get the same as the manual method above, and using test="Rao", you get the score test, which is the same as the chi-squared at the top:

anova(mod, test="LRT")
# Analysis of Deviance Table
# Model: binomial, link: logit
# Response: cbind(Yes, No)
# Terms added sequentially (first to last)
# 
#        Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
# NULL                       3     15.415            
# Groups  3   15.415         0      0.000 0.001494 **
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(mod, test="Rao")
# Analysis of Deviance Table
# Model: binomial, link: logit
# Response: cbind(Yes, No)
# Terms added sequentially (first to last)
# 
#        Df Deviance Resid. Df Resid. Dev    Rao Pr(>Chi)   
# NULL                       3     15.415                   
# Groups  3   15.415         0      0.000 14.697 0.002095 **
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Lastly, the naming of your rows (i.e., the Groups) is suspicious. Is this the result of categorizing an originally continuous variable? If so, that is very much not recommended. The categorization can arbitrarily create the appearance of different distributions where they don't actually exist. To get a sense of this, it may help to read this excellent answer (although the context differs): Assessing approximate distribution of data based on a histogram. You would do much better to use the original continuous values and decide a-priori what kind of difference in the distributions you might care to detect (mean shift, difference in spread, skew, heavy-tailedness, other tail behavior, multi-modality, etc.), and test for that explicitly.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 1
    Doesn't the chi squared test perform better when the sample is bigger? I think he wants to have an exact test. Can you please explain how to use the logistic regression? – Toney Shields Oct 08 '16 at 17:40
  • 2
    @gung my data isn't the result of categorizing an originally continuous variable, I simply did what' in this chapter (Table 3.4, page 18): data.princeton.edu/wws509/notes/c3.pdf – Enthusiastic Oct 08 '16 at 18:00
  • 1
    @Enthusiastic, your data, the age ranges, & the response differ from what is in table 3.4 on p18 there. But that's fine, I was just pointing out general information about situations like this. – gung - Reinstate Monica Oct 09 '16 at 01:24
  • 1
    @ToneyShields, there are more than enough data for the chi-squared test to perform perfectly fine. The rule of thumb is that the expected values should all be >5. Although this is now known to be too conservative), the OP's data nonetheless exceed that. Note further that a standard logistic regression is not an exact test. – gung - Reinstate Monica Oct 09 '16 at 01:27