1

I have two variables:

a: 12,13,15,10,9,8
b: 15,15,15,15,15,15

variable 'a' is aggregate numbers of choosing a particular answer (0 or 1) for each case (6 cases in this example).

So for example for the 1st case 12 out of 15 participants chose 1, 13 out of 15 in the second case, 15 out of 15 in the third etc.

variable 'b' predicts behavioral responses for each case. Here it predicts that for each case the majority of participants should select option 1.

To compare 'a' to 'b' I calculated the difference between each aggregate response:

diff: 3,2,0,5,6,7 which is a total of: 23

I would like to compute a confidence interval for this score.

upabove
  • 2,657
  • 10
  • 30
  • 37
  • @whuber I'll ask it here separately to have a more focused discussion. I made it as explicit as possible here. – upabove Aug 24 '11 at 15:52
  • '12 participants chose 1' out of how many? Do all the 6 numbers refer to the same participants? Do you truly want the sum of differences, even if some are negative, and cancel out the positive differences? – Aniko Aug 24 '11 at 17:06
  • @Aniko This question does not correctly represent what's really going on. For background, consult the [previous question](http://stats.stackexchange.com/questions/14355/comparing-two-binary-distributions) – whuber Aug 24 '11 at 17:13
  • Daniel, this question misleads your readers. It will therefore produce misleading answers, through no fault of theirs. Some things they need to know are that (a) you have not disclosed the total number of participants, which is a crucial statistic (maybe it's 15?), and (b) the model does not predict the number of participants, it merely predicts which answer should be selected by the *majority* of participants. By recoding the answers as necessary, we can assume that answer "1" is the model prediction in every case. *Thus 'b' is entirely irrelevant to the question.* – whuber Aug 24 '11 at 17:17
  • @whuber I've edited the question – upabove Aug 24 '11 at 17:26
  • Thanks, but it's still deceptively phrased: 'b' is not a variable at all. Succinctly: you observe the outcome $(a_i)$ of a vector $(X_i)$ of independent Binomial$(n,p_i)$ distributions ($n$=15) and you seek a confidence interval for $n \sum_i (1-p_i)$. – whuber Aug 24 '11 at 17:31

1 Answers1

3

Note: Thinking again, if you use the normal approximation (as I did below), deriving the mass function as done below is a bit of overkill - all you need are the first two moments of $Q = \sum_{i} Z_{i}$, which, by properties of means and variances are simply $nE(X_{i}-Y_{i})$ and $n{\rm var}(X_{i}-Y_{i})$. The mass function would be unecessary if you wanted to do an exact test, which would require the CDF of $Q$, though.

So, you're basically asking about the distribution of $Q = \sum_{i} Z_{i}$ where $Z_{i} = Y_{i} - X_{i}$ so you can form a confidence interval for the expected value of $Q$, $E(Q)$, where $Y_{i} \sim {\rm Binomial}(n_{1}, p_{1})$ and $X_{i} \sim {\rm Binomial}(n_{2}, p_{2})$ for each $i$, right? It wasn't clear to me whether or not $n_{1} = n_{2}$, so I will not assume this is true.

From the law of total probability, we have, for each $i$,

$$ P(X_{i} - Y_{i} = k) = E_{Y} \Big( P(X_{i}-Y_{i} = k) \Big) = \sum_{y=0}^{\infty} P(Y_{i}=y)P(X_{i} = k+y) $$

so

$$ P(X_{i} - Y_{i} = k) = \sum_{y=0}^{n_{1}} {n_{1} \choose y} p_{1}^{y} (1-p_{1})^{n_{1} - y} {n_{2} \choose k+y} p_{2}^{k+y} (1-p_{2})^{n_{2} - k - y} $$

Where ${n_{2} \choose k+y}$ is defined as 0 when $k+y > n_{2}$. Using this formula, one can easy, numerically, calculate the mean and variance of $X_{i} - Y_{i}$ using, for example, this R code:

massfcn <- function(k,n1,n2,p1,p2)
{
   s <- 0 
   for(y in 0:n1) 
   {
      term1 <- choose(n1,y)*(p1^y)*((1-p1)^(n1-y))
      term2 <- choose(n2,k+y)*(p2^(k+y))*((1-p2)^(n2-y-k))
      s <- s + term1*term2
   }
   return(s)
}

# expected value of X-Y for given n1,n2,p1,p2
E <- function(n1,n2,p1,p2)
{
   # range of possible values for X-Y
   s = seq(-n1,n2,by=1)
   len = length(s)

   mn = 0
   for(i in 1:len) mn = mn + massfcn(s[i],n1,n2,p1,p2)*s[i]

   return(mn)
}

# variance of X-Y for given n1,n2,p1,p2
VAR <- function(n1,n2,p1,p2)
{   
   # range of possible values for X-Y
   s = seq(-n1,n2,by=1)
   len = length(s)

   mn = 0
   for(i in 1:len) mn = mn + massfcn(s[i],n1,n2,p1,p2)*(s[i]^2)

   return(mn - E(n1,n2,p1,p2)^2)
}

You can use the functions above to obtain $\mu = E(X_{i} - Y_{i})$ and $\sigma^{2} = {\rm var}(X_{i} - Y_{i})$. Then if each of the $i$'s are independent, it follows that $E(Q) = n \mu$ and ${\rm var}(Q) = n \sigma^{2}$. What you're looking to test is equivalent to testing whether $\mu = 0$ - scaling by $n$ to produces the sample mean, $ \overline{Z} = \frac{1}{n} Q$, which has mean $\mu$ and variance $\sigma^{2}/n$. Therefore, rejecting $\mu = 0$ if the interval

$$ (\overline{Z} - 1.96\sqrt{\sigma^{2}/n}, \overline{Z} + 1.96\sqrt{\sigma^{2}/n}) $$

contains 0 is an approximate $5\%$ test of that null hypothesis (approximate in the sense that the central limit theorem is used to form this interval). Technically, since $p_{1}, p_{2}$ are not known and are estimated from the data, the $\sigma^{2}$ in the above equation should be replaced by $\hat{\sigma}^{2}$, but the procedures and the large sample distribution remains the same.

# example 
# some (n=50) fake binomal(12,.3) and binomial(12,.4) data.
x = rbinom(50, 12, .3) 
y = rbinom(50, 12, .4) 

# numbers of trials
n1 = 12
n2 = 12

# sample proportions
p1 = mean(y)/n1
p2 = mean(x)/n2

# estimate E(X-Y)
MEAN = E(n1,n2,p1,p2)

# estiamte var(X-Y)
VARIANCE = VAR(n1,n2,p1,p2)/50

# get an approximate confidence interval
CI = c(MEAN-1.96*sqrt(VARIANCE), MEAN+1.96*sqrt(VARIANCE))
Macro
  • 40,561
  • 8
  • 143
  • 148
  • I'm sorry but I do not understand all this notation, can you please explain it in simple words? – upabove Aug 24 '11 at 17:25
  • Your question seemed to reduce to testing whether the binomial outcomes had a different means in each group - the final equation just gives a confidence interval for that difference - $\overline{Z}$ is the mean difference between the groups, $\sigma^{2}$ (which can be calculated using the R function I provided, or using properties of variances) is the variance of the difference, $n$ is the sample size. – Macro Aug 24 '11 at 17:29
  • @Macro and in the functions what is n1, n2, p1, p2,y, k ? – upabove Aug 24 '11 at 17:32
  • The normal approximation will be poor, although it gives some guidance. – whuber Aug 24 '11 at 17:32
  • Daniel, you've said that you're envisioning each group as arising from binomial random variables, which are parameterized by the number of trials $n$ and the success probabilities $p$. $n_{1}, n_{2}$ are the numbers of trials in each group, $p_{1},p_{2}$ are the success probabilities. The numbers of trials should be known a priori (by design) but the success probabilities, of course, are estimated from the data. $k$ is the point at which you're evaluating the mass function. So, massfcn(5,10,10,.5,.3) is the probability that a binomial($10,.3$) is 5 larger than a binomial($10,.5$) – Macro Aug 24 '11 at 17:34
  • @Macro the number of trials are the same for both models. There are 8 questions for which 15 people answered 0 or 1. The model predicts 0 or 1 for each 8 question. Why are there two n and two p-s then? thank you for your help – upabove Aug 24 '11 at 17:36
  • It wasn't clear to me that there were the same number of trials in each group. That's simply the special case where $n_{1} = n_{2}$. This is actually preferable since a difference in the means of the two groups must, then, be due to a difference in the success probabilities, $p_{1}, p_{2}$ rather than the numbers of trials. – Macro Aug 24 '11 at 17:38
  • @Macro by success probability you mean the proportion choosing it? – upabove Aug 24 '11 at 17:38
  • I mean the proportion of successes. If 15 questions were asked and the average person said yes to 9 of them, the sample proportion is $9/15$. I added a simulated example to the bottom of the post. Hopefully that helps a bit. – Macro Aug 24 '11 at 17:48
  • @Macro this is great help! Just one thing: its actually not success / failure but a response that is red/blue coded as 1/0. So the observed values are just the proprotion to people choosing say red out of 12. And the model predicts what the majority should predict. So my only problem is how can I determine the p2 for the model? is it just 1 then? – upabove Aug 24 '11 at 18:01
  • @Macro let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/1178/discussion-between-daniel-and-macro) – upabove Aug 24 '11 at 18:14
  • @Macro when you return and have time can you please answer the last question I asked in the chat? Thank you very much for all your help – upabove Aug 24 '11 at 18:57
  • @Macro basically the question is: I have 8 questions and 19 respondents each. Obviously p1 changes for each question how can I use this method if I have 8 different 'p's? – upabove Aug 24 '11 at 21:45