$\chi^2$ test of significance vs. goodness of fit

Question

I have 644 companies out of which 154 went bankrupt. I investigate below if the bankruptcy is related to the sector type:

            Bankrupt     Nothing        total   
-----------------------------------
BioTech |     15          110        |  125
Airline |     20          120        |  140
AutoCos |     50          100        |  150
Telecom |     60          40         |  100
Oil&Gas |     9           120        |  129

Hit rate = (15+20+50...) / 125+140+150....

I need to find if the bankruptcy and sectors are inter-dependent. Once that is proven, I have to investigate each sector to check the relationship. I want to check if my following approach is correct:

Test of Independence: Use (Obs-Exp)^2/Exp on 'bankruptcy' & 'nothing' variables for each group. Sum them up. Expected value for events is HitRate*Total (e.g for BioTech, it is (154/644)x125 for Event. Corresponding expected value for nothing is 125 - (154/644)x125. Use this Chi-Sq and compare it to say 95% probability using df = 4.

Goodness of Fit If test of independence rejects the null hypothesis, I investigate whether each sector has some power. I do exactly what I did above but with df = 1 for each bin.

Q1. Is my second step correct? (i.e. is my goodness of fit test actually a test of independence?)

Q2. Is my over all approach statistically correct?

It isn't clear where `HitRate` comes from. Is it just the percentage of observations that is in the `Event` column? If so, that is incorrect. Your GoF analysis doesn't make any sense to me. Btw, where do these bins come from? Often 'bin' is used to denote continuous data that have been categorized. If that's the case here, this is a bad thing to do & other analyses would be better for your task. — gung - Reinstate Monica, Mar 03 '14 at 22:30
Yes. If you refer to: http://omega.albany.edu:8008/mat108dir/chi2independence/chi2in-m2h.html, the approach actually uses Event#/Total# to get the expected value for each data-group. I have re-defined the bins so that it becomes clearer. Is my GoF solution wrong? — Maddy, Mar 03 '14 at 22:56
I don't see where on that page they suggest Event#/Total#; the correct formula is sum(row_i)*sum(col_j)/N, & that formula is listed on that page. Thanks for editing the bin names; I'm glad they really are nominal categories. Are you thinking of those categories as predictor / explanatory variables & Event / Nothing as a response variable? — gung - Reinstate Monica, Mar 04 '14 at 00:51
If you look closely, it is (sum(col_j)/N)*sum(row_i) i.e. (4228/10000)*6383 which is nothing but HitRate * UniverseSize for that group. — Maddy, Mar 04 '14 at 01:28
I'm still not sure that I'm following you; there should be a different expected count for every cell. But set that aside, Are you thinking of the categories (ie, BioTech, Airline, etc) as predictor variables & Event/Nothing as an outcome? — gung - Reinstate Monica, Mar 04 '14 at 02:07
You'd have to work through the example on that pdf to see how it is working. I have replicated that solution. Regarding those categories: imagine you have 644 companies out of which 154 went bankrupt. You want to see if a company's sector somehow explains bankruptcy. So you create that table and investigate. Is the scenario clearer now? Please let me know. Thanks. — Maddy, Mar 04 '14 at 15:32
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/13370/discussion-between-maddy-and-gung) — Maddy, Mar 04 '14 at 15:35
I think we won't need to continue this discussion. It seems that you think of sector as a predictor variable & bankruptcy as an outcome. You want to know if bankruptcy rates differ by sector. That means you don't want to use the chi-squared test. — gung - Reinstate Monica, Mar 04 '14 at 15:38
@gung thanks for the LR solution. I'm familiar with it. LR makes sense if its a plot between Cancer and Age. Chi Square investigates whether frequencies of certain categorical variables (such as Pass/Fail) is independent of the groups. Eg, financial viability may be related to sector (esp airlines during oil-shocks). Please don't end the discussion until we agree on whether Chi Sq is the better solution here. — Maddy, Mar 04 '14 at 18:46

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

You don't want to use the chi-squared test to analyze this situation. It does not correspond to the question you want to answer. Instead, you should use logistic regression. To learn more about the distinction between using a test which assumes one variable is a predictor (e.g., logistic regression) and a test that does not assume any of the variables are predictors (e.g., the chi-squared test), it may help you to read my answer here. If you are unfamiliar with LR, you may want to read through some of the threads on CV categorized under the logistic tag. For some basics, my answer here explains the ideas behind probabilities, odds, odds ratios, and log odds; my answer here is written in a different context, but ends up providing an overview of what LR is all about in order to answer the OP's question. It has been a long time since I've used MATLAB, and I don't think I ever fit a LR model with it, but I gather the function to use is glmfit(); a walk-through of a simple example can be found in this blog post.

If you were to analyze these data in R, it would be:

my.data = read.table(text="Sector       Bankrupt    Nothing      total   
                           BioTech      15          110          125
                           Airline      20          120          140
                           AutoCos      50          100          150
                           Telecom      60          40           100
                           Oil&Gas      9           120          129", header=TRUE)

sector = c()
for(i in 1:5) sector = c(sector, rep(as.character(my.data$Sector[i]), my.data$total[i]))
bankrupt = c(rep(1, 15), rep(0, 110), rep(1, 20), rep(0, 120), rep(1, 50), rep(0, 100), 
             rep(1, 60), rep(0, 40), rep(1, 9), rep(0, 120))

lr.model = glm(formula=bankrupt~sector, family=binomial(link="logit"))
anova(lr.model, test="LRT")
# Analysis of Deviance Table
# 
# Model: binomial, link: logit
# 
# Response: bankrupt
# 
# Terms added sequentially (first to last)
# 
# 
# Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
# NULL                     643      708.5              
# sector  4   111.09       639      597.4 < 2.2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Following a significant result from the LR model, you could conduct pairwise tests for equality of proportion bankrupt. I gather the MATLAB function is ztest(). In R it would be:

props = with(my.data, Bankrupt/total)
names(props) = my.data$Sector
props
#    BioTech    Airline    AutoCos    Telecom    Oil&Gas 
# 0.12000000 0.14285714 0.33333333 0.60000000 0.06976744 

my.table = as.table(cbind(my.data$Bankrupt, my.data$Nothing))
rownames(my.table) = my.data$Sector
colnames(my.table) = c("Bankrupt", "Nothing")
my.table
#         Bankrupt Nothing
# BioTech       15     110
# Airline       20     120
# AutoCos       50     100
# Telecom       60      40
# Oil&Gas        9     120
prop.test(my.table[4:5,])
# 
# 2-sample test for equality of proportions with continuity correction
# 
# data:  my.table[4:5, ]
# X-squared = 72.7321, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
#   0.4157529 0.6447122
# sample estimates:
#   prop 1     prop 2 
# 0.60000000 0.06976744

If you didn't know, a-priori, which comparisons you were interested in testing, but simply tested whichever were suggested by the observed proportions, you may want to adjust the critical alpha to control for familywise error rates. With all pairwise comparisons, there are $5*4/2 = 10$ possible comparisons (and which are not orthogonal), so you could use the Bonferroni correction by dividing alpha by 10 to determine the threshold you want to use for significance (i.e., $.05/10=.005$).

You can learn more about these sorts of issues by reading the threads on CV categorized under the multiple-comparisons tag.

I'm accepting this answer as it pointed me to the right direction (though it was in R!). It might help someone implement the same in R. — Maddy, Mar 07 '14 at 19:45

$\chi^2$ test of significance vs. goodness of fit

1 Answers1

Linked