Comparing two binary distributions

Question

I have a dataset with responses coded as 0 and 1. I am trying to specify 3 models of responses and compare it to the observed results. I would therefore like to make single comparisons between two binary distributions. Is there a way to do this?

Basically what I want is to compare observed results for 16 question per participant with a model that has predictions for each 16 questions. I would like to see how much these predictions match the actual observed responses.

compare what exactly about the two populations? The means? Covariate effects? — Macro, Aug 16 '11 at 15:46
@Macro I'm interested in differences in response patterns, so whether for each question the two values match or not — upabove, Aug 16 '11 at 15:55
Does the model predict the answers for each questioner (based perhaps on their individual attributes) or does it predict the aggregate set of answers, such as the proportion of yeses in each question? — whuber, Aug 16 '11 at 21:38
@whuber it predicts individual responses but its not based on individual attributes, so it predicts the same for each participant but for each question — upabove, Aug 17 '11 at 07:38
@whuber its red/blue. participants responded to 16 questions with red/blue for different situations. I would like to compare these predictions with a model M1 that predicts specific answers for each question, M2 that predicts another set of answers, M3, etc. I would like to make single comparisons M0 to M1, M0 to M2 — upabove, Aug 17 '11 at 12:58
It is unusual for models to make such specific predictions about aggregate answers: that is so restrictive that such models are often of little value. Typically, a model predicts a frequency (or probability, if you prefer) for each answer, such as "60% of the answers to question 13 will be 'blue'." Even a model that ultimately makes a binary prediction, like yours, internally computes a frequency and then replaces that with the option its favors. If yours does that, you would be better off extracting that frequency and using it as-is. — whuber, Aug 17 '11 at 13:03
@whuber basically I'm trying to specify behavioral types, M1 would predict that all player's send honest answers, M2 would predict that they send the opposite, M3 that they use a more sophisticated technique (ex. randomize) so basically I'm just trying to see if people behave as-if they are M1 or M2. — upabove, Aug 17 '11 at 13:13

score 5 · Accepted Answer · answered Aug 17 '11 at 14:42

This unusual situation reverses the usual role of model and sample. Typically, we think of this situation as if the reply to any given question $q$ is a Bernoulli variable $X_q$ (like flipping an unbalanced coin) with parameter $p$, the model estimates $p$, and we have multiple independent realizations of $X_q$ to compare to $p$. Here, instead, the model does not estimate $p$, but produces exactly one of the two possible outcomes of $X_q$ (for each $q$).

To evaluate such a model, we need to quantify how good each prediction is compared to any other model. Because there are only two possible predictions per question, the problem becomes exceptionally simple: for each question there is a "good" prediction--presumably, the one that agrees with the majority of the responses--or a "bad" prediction. Making the bad prediction incurs a cost. In full generality, the cost can vary from question to question (because some questions might be considered more important or useful than others). However, let's suppose the goodness of a model depends separately on how good it is for each question. Whence, independently of the models, the investigator must specify the costs $(c_q)$, one for each question. To compare the models, sum the costs for the "bad" predictions in each model. The better model has the lower sum of costs.

If all questions are of equal probative value for model selection, then all costs will be the same (and can taken to be $1$ without any loss of generality). In this case, the cost of a model is the number of "bad" predictions it makes. The better model makes fewer bad predictions.

This leads to another question: if these participants form a representative sample of a population, as they usually do, and the purpose is to make inferences about this population, then the cost of each model is random. How certain can we be of our comparison?

To answer this, note that the aggregate response $k_q$ to question $q$, which is answered (say) by $n_q$ of the participants, has a Binomial$(n_q, p_q)$ distribution for an unknown parameter $p_q$. If the prediction made by model $m$ is $0$ (using some 0/1 indicator coding for red/blue), then the cost is

$$\text{Cost}_q(m,0) = c_q \text{ if } 2 k_q \gt n_q, \quad 0 \text{ otherwise.}$$

If the prediction made by $m$ is $1$, the cost is

$$\text{Cost}_q(m,1) = c_q \text{ if } 2 k_q \lt n_q, \quad 0 \text{ otherwise.}$$

(For simplicity I'm overlooking the possibility of ties, which can occur whenever $n_q$ is even: costs have to be stipulated in these cases, too. But that doesn't change the nature of the analysis; it's just a complication in the details.)

Consider, now, the comparison of two models. Because their costs can differ only for questions where the model predictions differ, we can forget about all the other questions. Without any loss of generality, then, we may assume the two models differ on all the questions. By recoding the Bernoulli variables if necessary (and changing $p_q$ to $1-p_q$ in such cases), we can arrange it so that one model, call it $m_0$, always predicts $0$ and the other, $m_1$, always predicts $1$. The observed difference between the cost of $m_1$ and the cost of $m_0$ is

$$Y = \sum_{q} I(2 k_q \lt n_q) c_q$$

where $I(\text{true})=1$ and $I(\text{false})=-1$. The true difference equals

$$\eta = \sum_{q} I(p_q \lt 1/2) c_q.$$

We seek either a confidence interval for $\eta$ based on the observations (the questionnaire results) or a test of the hypothesis $H_0: \eta \ge 0$.

At this point many solutions are possible, including a Bayesian one (upon adopting priors for the parameters $p_q$), an exact Frequentist one (using the Binomial distributions of the $k_q$), and an approximate Frequentist one. I'll sketch the latter as an illustration.

Estimate $p_q$ as $\hat{p}_q = k_q/n_q$. Use this to compute (via the theory of the Binomial distribution) $\Pr(2 k_q \lt n_q) = \hat{\pi}_q$. We have thereby estimated that the random variable $I(2 k_q \lt n_q) c_q$ has mean $\hat{\pi}_q I(\hat{\pi}_q \lt 1/2) c_q$ and variance $\hat{\pi}_q(1 - \hat{\pi}_q) c_q^2$. Consequently, we are supposing $Y$ has mean

$$\hat{\eta} = \sum_q \hat{\pi}_q I(\hat{\pi}_q \lt 1/2) c_q$$

and variance

$$\hat{\sigma}^2 = \sum_q \hat{\pi}_q(1 - \hat{\pi}_q) c_q^2.$$

Construct the approximate $1-\alpha$ confidence interval for the cost difference

$$[\hat{\eta} - Z_{\alpha/2}\hat{\sigma}, \hat{\eta} + Z_{\alpha/2}\hat{\sigma}]$$

by taking $Z_{\alpha/2}$ to be the $1-\alpha/2$ percentile of the standard normal distribution. When that interval does not cover $0$ we will conclude (with $1-\alpha$ confidence) that one of the models $m_0$ or $m_1$ is superior (the one with the lowest estimated cost, obviously).

These approximations have some obvious problems, but they can serve us well when (i) the $c_q$ don't vary much; (ii) none of the $p_q$ is very close to $0$ or $1$; and (iii) the model predictions differ on a "substantial" number of questions (perhaps 5 or more, but it depends on the sizes of the $p_q$ and $c_q$).

The three-model case involves three dependent comparisons. This will require some protection against the multiple comparison problem, perhaps with a Bonferroni adjustment of $\alpha$.

I'm finding the explanation very helpful and easy to follow. Also it addresses exactly what I'm trying to do, however, I'm not familiar with the notation you use. Can you maybe give it another 5 minutes and explain how to calculate the cost and the confidence interval with some other notation that I could understand better? I'm specifically interested in how you arrive at $$\eta and the confidence interval. Maybe if you just explain what $, \, hat, it, etc mean? thanks again! — upabove, Aug 17 '11 at 15:16
@Daniel Hats are conventional notations to distinguish properties of a sample (estimates made from the observations) from theoretical properties such as frequencies in the true population. You can't calculate the cost: that's a decision you have to make based on the questions themselves. This method of developing a confidence interval is routine, as exemplified by a [Wikipedia example](http://en.wikipedia.org/wiki/Confidence_interval#Practical_example). The probability calculations are based on the Bernoulli and [Binomial distributions](http://en.wikipedia.org/wiki/Binomial_distribution). — whuber, Aug 17 '11 at 15:25
but if each question is of equal weight and I specify the cost as 1 for each deviation then what we're looking for is basically the overall deviation between the model and the observations right? — upabove, Aug 17 '11 at 15:29
@Daniel Partly. We're comparing models to models. It's a matter of finding questions where two models differ and then deciding which one is more likely to be right. When all costs are equal, you compare two models simply by counting the questions where each seems to be correct. I have taken it for granted that when a model predicts "blue" and the majority of answers *in the population* are "blue," then that model is better than one predicting "red." Within the *sample* of 16 participants, though, it can happen by chance that the majority of answers are "red." That's what the CI is for. — whuber, Aug 17 '11 at 15:36
the sample is larger than 16. 16 is the number of answers each participant answers / number of predictions the model makes. So basically the fit of the model is just how many correct predictions it makes which will be "eta"?. Then I just estimate the mean and variance and calculate the confidence interval from this right? — upabove, Aug 17 '11 at 15:40
Yes. $\eta$ is the true differential cost of the two models, using the costs you would compute if you could question everyone in the population. (Sorry about my confusion between numbers of participants and numbers of questions.) — whuber, Aug 17 '11 at 15:45
but since 1 or 0 is not correct or incorrect, I don't see why the question of how we specify the cost should be interesting. Basically I just want to specify 1 as the cost so that I can look at the difference between the model and the observations. For example I observe 0,1,1, and the model predicts 0,0,1 then the fit is: 2 (it makes 2 correct predictions out of 3) then what exactly am I trying to find a confidence interval for? — upabove, Aug 17 '11 at 16:08
@Daniel let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/1096/discussion-between-whuber-and-daniel) — whuber, Aug 17 '11 at 16:09
Following up on your suggestion: In the case when I only have 1 model and I'm comparing it to the observed values. With a cost of 1 if the model receives a score 16 out of 16, so it correctly predicts the observations 100% how can I proceed in statistically showing this? — upabove, Aug 24 '11 at 11:26
@Daniel I don't understand your followup question. What precisely do you want to "statistically show?" — whuber, Aug 24 '11 at 13:35
Well I only want to compare one model to the observed values. Then I find that the model predicted 100%. How do I summarize this in a more formal manner. Do I just say that it predicted 100%? — upabove, Aug 24 '11 at 14:29
maybe I can put 100% for the model predictions and for the observed values put in the proportion choosing it and then just run a t.test? — upabove, Aug 24 '11 at 14:34
@Daniel Are you saying that*every* participant consistently answered every question as predicted by the model? That's what 100% means to me. — whuber, Aug 24 '11 at 14:47
No. I mean that I looked at aggregate responses for each question and determined the majority. Then compared it to the majority predicted by the model. — upabove, Aug 24 '11 at 14:51
@Daniel The approximate method in my reply doesn't apply, but an analogous approach based on the binomial distributions of the responses can obtain an accurate confidence interval for the model prediction, *assuming* answers to each question are mutually independent. — whuber, Aug 24 '11 at 15:04
you mean calculating the mean and variance for the observed questions? — upabove, Aug 24 '11 at 15:10

score 2 · Answer 2 · answered Aug 16 '11 at 16:03

2

If you are comparing your predictions to the actual variable, you could make a Confusion Matrix to asses the performance of each model.

answered Aug 16 '11 at 16:03

Zach

22,308
18
114
158

something along these lines might be fine, if there's nothing else in the significance testing domain I can use. But if I go with this confusion matrix it might even be easier to just look at the deviations, mark it as 1 where it matches and 0 where it doesn't and then just calculate the % of matches. – upabove Aug 16 '11 at 16:14
I've edited my question to add more information – upabove Aug 16 '11 at 16:16

score 1 · Answer 3 · answered Aug 16 '11 at 16:58

1

Perhaps you want to arrive at the confidence interval for the probability $\theta$ that the predicted variable is the same as the observed one. That will give you the an estimate (mid-point of the interval) of how good the predictions are, and how confident (significance level) you can be that the accuracy lies in a certain range (interval width).

answered Aug 16 '11 at 16:58

highBandWidth

2,092
2
21
34

yes that could be a solution. But how exactly can I do that for 16 0s and 1s? – upabove Aug 17 '11 at 09:49

Comparing two binary distributions

3 Answers3

Linked