9

I have a data set of cell phone customer information data with twocolumns. The first column contains the certain category that an account falls in (either A, B or C) and second column contains is binary valued for whether that account has cancelled. eg

A | cancelled
C | active
B | active
A | cancelled

what I want to do is come up with some sort of hypothesis test to test whether the ratio of accounts of type A, B and C is different for active accounts vs cancelled accounts - the null hypothesis being that they are the same. So its like a hypothesis test for proportions except I do not know how to do this for 3 values

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
user1893354
  • 1,435
  • 4
  • 15
  • 25
  • 6
    You could use a $\chi^2$ test to test for equality of proportions amongst the three groups. –  Sep 24 '13 at 17:17
  • I'm also thinking I could do three hypothesis tests A vs B, B vs C, and A vs C, to see if they are different – user1893354 Sep 24 '13 at 17:31
  • 5
    You could, but be aware that you would then have to correct for problems of multiple comparisons. –  Sep 24 '13 at 17:35
  • Thank you for your answer. I'm just curious what you mean by problems of multiple comparisons? Or, more specifically, why the three hypothesis test method is disadvantageous. Thanks! – user1893354 Sep 24 '13 at 18:13
  • 3
    Thee are two problems with using three hypothesis tests. First, they are interdependent because each pair reuses some of the data. Second, *if* they were actually independent, then the chance that at least one of them would be significant even when the null is true--that is, the chance of a false positive error--would be almost three times greater than the desired false positive rate. The second problem indicates the test needs to be adjusted, but the first one shows that finding the appropriate adjustment may be problematic. The $\chi^2$ approach avoids these problems. – whuber Sep 24 '13 at 20:29
  • "if they were actually independent, then the chance that at least one of them would be significant even when the null is true--that is, the chance of a false positive error--would be almost three times greater than the desired false positive rate" It doesn't seem like that should be a big deal. It is obvious that the more tests you do, the greater the likelihood that one would be a false positive. I suppose it would make a significant difference in some cases though, thanks! – user1893354 Sep 24 '13 at 20:53
  • It's possible to partition the chi-square into specific contrasts, so if you have - *a priori* - two orthogonal linear components (such as A vs B and C vs A&B) you want to test then you can do that within the chi-square framework. If all three pairwise comparisons matter then you're left with the multiple comparisons issue (not everyone is especially bothered about familywise error rates though). – Glen_b Sep 25 '13 at 01:15
  • OT, but are you in the US MVNO business? I've seen your problem questions and they are remarkably like mine. =) – Rizwan Kassim Dec 11 '13 at 06:58

1 Answers1

14

I am going to base my answer in general and insert comments as to how your problem fits into the testing framework. In general, we can test for equality of proportions using a $\chi^2$ test where the typical null hypothesis, $H_0$, is the following:

$$H_0:p_1=p_2=...=p_k$$

i.e., all of the proportions are equal to each other. Now in your case you null hypothesis is the following:

$$H_0:p_1=p_2=p_3$$ and the alternative hypothesis is $$H_A:\text{ at leat one }p_i\text{ is different for }i=1,2,3$$

Now in order to carry out the $\chi^2$ test we need to calculate the following test statistic: The value of the test-statistic is

$$\chi^2=\sum_{i=1}^n\frac{(O_i-E_i)^2}{E_i}$$

where

  • $\chi^2$ = Pearson's cumulative test statistic, which asymptotically approaches a $\chi^2$ distribution
  • $O_i$ = the observed frequency
  • $E_i$ = an expected (theoretical) frequency, asserted by the null hypothesis
  • $n$ = the number of cells in the table

In your case $n=6$ since we can think of this problem as being the following table: enter image description here

Now once we have the test statistic we have two options of how to proceed to complete our hypothesis testing.

Option 1) We can compare our test static $\chi^2$ to the appropriate critical value under the null hypothesis. That is to say, if $H_0$ is true, then a $\chi^2$ statistic from a contingency table with $R$ rows and $C$ columns should have a $\chi^2$ distribution with $(R-1)\times(C-1)$ degrees of freedom. After calculating our critical value $\chi^*$ if we have that $\chi^2>\chi^*$ then we will reject the null hypothesis. Obviously if $\chi^2\leq\chi^*$ then we fail to reject the null hypothesis.

Graphically (all of the numbers are made up) this is the following: enter image description here

From the graph, if our test statistic $\chi^2$ correspond to the blue test statistic then we would fail to reject the null hypothesis since this test statistic does not fall inside the critical region (i.e., $\chi^2<\chi^*$). Alternatively, the green test statistic does fall inside the critical region and so we would reject the null hypothesis had we calculated the green test statistic.

In your example, your degrees of freedom are equal to $$df = (R-1)\times(C-1)=(2-1)\times(3-1)=1\times2=2 $$

Option 2) we can calculate the p-value associated withe the test statistic under the null hypothesis and if this p-value is less than then some specified $\alpha$-level then we can reject the null hypothesis. If the p-value is greater than the $\alpha$-level then we fail to reject the null hypothesis. Note that the p-value is the probability that a $\chi^2_{(R-1) \times(C-1)}$ distribution is greater than the test statistic.

Graphically we have that enter image description here

where the p-value is calculated as the area that is greater than our test statistic (the blue shaded area in the example).

So, if $\alpha>\text{p-value}$ then fail to reject the null hypothesis $H_0$, else,

if $\alpha\leq\text{p-value}$ the reject the null hypothesis $H_0$