1

I recently posted https://stats.stackexchange.com/questions/153335/multinomial-mixture-model but got no answer. Hence I take the liberty to present what I came up with on my own and a follow up question:

I have two processes which generate a series of A,C,G,T's. Let's say I know the probabilities for each process to generate a particular base.

prob1 = c( 1/100,1/100,97/100,1/100 )
prob2 = c( 1/4,  1/4,  1/4,   1/4 )

I can test a random variable observed with a chi-squared if it can be explained by one of the processes (e.g., process1) by

N=length(observed)
chi2_statistic <- sum( (observed-(N*prob1))^2/(N*prob1) )
pchisq(chi2_statistic,df=4-1, lower.tail=FALSE)

Here, the number of degrees of freedom is df=4-1 as I learnt from several sources.

But to test if my variable observed can be better explained by the mixture of process1 and process2, I came up with the following procedure: I test all values of f=seq(1/100,99/100,1/100) if observed can be explained by a mixture of f process1 and 1-f process2

f <- 1/3
prob_mix <- prob1 * (1-f) + prob2 * (f)

Again I can use this merged or mixed probabilities in a chi-squared test if this explains my observed variable better.

chi2_statistic <- sum( (observed-(N*prob_mix))^2/(N*prob_mix) )
pchisq(chi2_statistic,df=4-2, lower.tail=FALSE)

Two questions:

  1. Do you think such an procedure is statistically sound? (I tested it with simulated data and do like the results)

  2. For the pchisq: what is the appropriate number of degrees of freedom since I have still 4 categories but somehow merged probabilities. My intuition tells me that the number of degrees of freedom is larger, but I'd like to have favoured to explain my data without this mixture process; therefore I should include a smaller df (therefore I used 2 so far). Can somebody deduce the precise df?

fabian
  • 21
  • 2
  • Your procedure has several problematic elements. These are addressed generally in my answer at http://stats.stackexchange.com/questions/16921/how-to-understand-degrees-of-freedom/17148#17148: (1) You must use maximum likelihood to estimate `prob_mix` (or, equivalently, $f$); (2) when you do so, the degrees of freedom will be $1$ (because it takes exactly one real parameter to specify the fit); (3) "liking" the results is not a legitimate way to evaluate any statistical procedure; and (4) this doesn't determine whether your model is better than the original. – whuber May 26 '15 at 17:00

0 Answers0