Proper Statistical Test for Binary Data

Question

I looking for the best statistical test to apply in a particular situation and I hope I can find here the answer(s) I'm looking for.

First of all some details:

I'm studying 33 different mutants of a particular protein and I've partitioned these mutants in 4 small groups on the basis of their severity:

Group A has 11 mutants
Group B has 8 mutants
Group C has 6 mutants
Group D has 8 mutants

I can test these mutants for the presence/absence of a series of particular internal interactions and I want to know if there is a statistical difference among the 4 groups. These internal interactions are essentially independent binary variables: 0 the mutant does not have a particular interaction or 1 the mutant has the interaction.

Basically, what I want to do is checking if there is a significant statistical difference in the percentage of mutants of each group that sport or not a series of these interactions.

My final goal is to correlate the presence/absence of some of these interactions with the severity of the mutations and find out which of these interactions are peculiar of a given group.

This is an example with real data:

Interaction #1

27.3% of the mutants in Gourp A has this interaction
12.5% of the mutants in Gourp B has this interaction
83.3% of the mutants in Gourp C has this interaction
50.0% of the mutants in Gourp D has this interaction

My question is: Which statistical test should I use to check if the differences in these percentages are statistically significant?

Thank you

[edit]

As suggested by @AndrewM, here are some more details about what I'm trying to do.

I've ~150 interactions and only a few of them are missing solely in mutants of GroupD (highest severity), while the vast majority are variably missing by mutants in all clusters.

What I need is an unbiased way to highlight those interactions that, even if also missing in a small number of other mutants in other clusters, could be defined are typically missing in GroupD.

My final aim is to test if I can explain, at least partially, the severity of these mutants looking at their missing interactions and then correlate mutant severity with presence/absence of the interactions.

Thanks

I'm assuming that you have a categorical variable Group which is the severity of your mutant. And a second categorical values with two states, "have particular interaction" or "don't have". This isn't binary data, your doing statistics over proportions. I think what you are looking for is to test if different groups are independent of each other, in other words if the severity of group affects having or not having a particular interaction. If thats the case you are looking for a simple chi-square independence test. — Ramalho, Oct 08 '14 at 10:56
Hi @Ramalho. `...you have a categorical variable Group`. The grouping derives from a continuous "severity score" which correlate well with experimental data. `...if the severity of group affects having or not having a particular interaction`. Actually the opposite, if having or not having a given interaction is "specific" of a given group. In the example in my question, only 12.5% of group B has that particular interaction. My question is: on the basis of those differences in proportions, can I say that not having that particular interaction is statistically linked to severe mutants? thanks — clusterman, Oct 08 '14 at 12:12
Ok, I understand your problem a little better, but what is your ultimate goal, to properly classify a mutant given the presence or absence of specific interactions? Or you just want to study the role of a specific interaction in the severity of the group? — Ramalho, Oct 08 '14 at 13:52
Hi @Ramalho, thank you again for your answer. My ultimate goal is to find a "signature" of missing interactions strongly associated with severe mutations. In other words: I want to know which interactions, if missing, are associated with a severe condition. — clusterman, Oct 08 '14 at 14:26
Perhaps read up on z-tests of proportions, eg http://stats.stackexchange.com/questions/11537/tests-on-binomial-distribution?rq=1 or http://www.socscistatistics.com/tests/ztest/ — Andrew M, Oct 08 '14 at 20:11
Why group, rather than use the variable you used to obtain the groups? It seems like you're dividing a potentially continuous scale into 5 groups, which is usually not the best idea. — Glen_b, Oct 08 '14 at 21:51
Hi @AndrewM, thank you for your answer. Correct me if I'm wrong but the z-test for proportions says if 2 prop are different. I've ~150 interactions and only a few of them are missing only in mutants of GroupD (highest severity), while the vast majority are variably missing by mutants in all clusters. What I need is an unbiased way to highlight those interactions that, even if also missing in a small number of other mutants,are typically missing in GroupD. My final aim is to test if I can explain, at least partially, the severity of these mutants looking at their missing interactions. — clusterman, Oct 10 '14 at 10:01
@Clusterman, maybe it would help to edit your question to clarify this, because this comment makes it seem like you want correlate mutant severity with presence/absence of the interaction, rather than just "check if the differences in these percentages are statistically significant." — Andrew M, Oct 10 '14 at 18:33
Hi @AndrewM, thank you for your suggestion. I just added some more details to the original post. — clusterman, Oct 11 '14 at 09:17

score 1 · Answer 1 · answered Oct 08 '14 at 19:32

1

Have you looked at $\chi^2$ statistics of independence?

Sounds like a classic use case for me: test whether the binary indicators you have and the mutant rate are independent.

For small sample sizes, you may need to use Yates's correction for continuity. Depending on the side of the test, you may want to do a similar adjustment the other way - to make sure you err on the wrong side (i.e. assume independence if in doubt).

answered Oct 08 '14 at 19:32

Has QUIT--Anony-Mousse

39,639
7
61
96

The groups are ordered. This would ignore the ordering, throwing out a lot of power. – Glen_b Oct 08 '14 at 21:50
Are they? In his example, Interaction#1, there doesn't seem to be an order in effect. – Has QUIT--Anony-Mousse Oct 09 '14 at 14:05
To quote the OP: "*I've partitioned these mutants in 4 small groups on the basis of their severity*". Unless I misunderstand something, A-D represent ordered categories of severity – Glen_b Oct 09 '14 at 20:29
Hi Anony-Mousse and @Glen_b, thank you for your answers. Yes, mutants are grouped on the basis of their severity, which is higher for those in GroupD and lower for those in A, with B and C being in the middle. I've ~150 interactions to screen and my goal is to find those missing in GroupD. Only a few interactions are missing only by the mutants in Group D, while the vast majority of them are variably missing by mutants in all clusters. What I need is an unbiased way to highlight those interactions that, even if also missing in a small number of other mutants, are typically missing in group D – clusterman Oct 10 '14 at 09:30

score 0 · Answer 2 · edited Jun 11 '20 at 14:32

I'm studying 33 different mutants of a particular protein and I've partitioned these mutants in 4 small groups on the basis of their severity:

Group A has 11 mutants Group B has 8 mutants Group C has 6 mutants Group D has 8 mutants

This can be modelled as:

$P(Group_A) = 11/33 \approx 0.333$

$P(Group_B) = 8/33 \approx 0.242$

$P(Group_C) = 6/33 \approx 0.182$

$P(Group_B) = 8/33 \approx 0.242$

You also give us this example:

Interaction #1

27.3% of the mutants in Gourp A has this interaction

12.5% of the mutants in Gourp B has this interaction

83.3% of the mutants in Gourp C has this interaction

50.0% of the mutants in Gourp D has this interaction

Which can be thought as:

$P(Interaction_1 | Group_A) = 0.273$

$P(Interaction_1 | Group_B) = 0.125$

$P(Interaction_1 | Group_C) = 0.833$

$P(Interaction_1 | Group_D) = 0.5$

Thus you also know that:

$P(\lnot Interaction_1 | Group_A) = 1 - P(Interaction_1 | Group_A) = 0.727$

$P(\lnot Interaction_1 | Group_B) = 1 - P(Interaction_1 | Group_B) = 0.875$

$P(\lnot Interaction_1 | Group_C) = 1 - P(Interaction_1 | Group_C) = 0.167$

$P(\lnot Interaction_1 | Group_D) = 1 - P(Interaction_1 | Group_D) = 0.5$

My ultimate goal is to find a "signature" of missing interactions strongly associated with severe mutations. In other words: I want to know which interactions, if missing, are associated with a severe condition

As I interpret it, you want to know which interactions if absent are highly to result in the case $i$ to belong to $Group_A$, which I assume here, is the group of higher severity. I don't know what you mean by "strongly associated", are you bringing in correlation?!

You can check the probability of a mutant belonging to $Group_A$(having high severity) given that he tested negative for interaction 1 as, $P(Group_A | \lnot Interaction_1)$.

This can be calculated as:

$P(Group_A | \lnot Interaction_1) = \frac{P(Group_A, \lnot Interaction_1)}{P(\lnot Interaction_1)} = \frac{P(\lnot Interaction_1 | Group_A)*P(Group_A)}{P(\lnot Interaction_1)}$

Using the same logic you can calculate $P(Group_A | \lnot Interaction_1, \lnot Interaction_1, ...)$ or any combination of existent/absent interactions.

Still I think this isn't quite what you are asking for, give us more specifications and examples of what you want to accomplish if that's the case.

You seem to want to classify the severity of a mutant by examining and studying a single variable, while I can only think of an estimate of his severity as relation of two or more interactions.

Hi @Ramalho. There are ~150 interactions and only a few of them are missing only in mutants of GroupD (the group with the higher severity), while the vast majority are variably missing by mutants in all clusters. What I need is an unbiased way to highlight those interactions that, even if also missing in a small number of other mutants, can be "labelled" as typically missing in GroupD. My final aim is to test if I can explain, at least partially, the severity of the mutants in groupD looking at their missing intersections. — clusterman, Oct 10 '14 at 09:46

Proper Statistical Test for Binary Data

2 Answers2