I recently posted https://stats.stackexchange.com/questions/153335/multinomial-mixture-model but got no answer. Hence I take the liberty to present what I came up with on my own and a follow up question:
I have two processes which generate a series of A,C,G,T's. Let's say I know the probabilities for each process to generate a particular base.
prob1 = c( 1/100,1/100,97/100,1/100 )
prob2 = c( 1/4, 1/4, 1/4, 1/4 )
I can test a random variable observed
with a chi-squared if it can be explained by one of the processes (e.g., process1) by
N=length(observed)
chi2_statistic <- sum( (observed-(N*prob1))^2/(N*prob1) )
pchisq(chi2_statistic,df=4-1, lower.tail=FALSE)
Here, the number of degrees of freedom is df=4-1
as I learnt from several sources.
But to test if my variable observed
can be better explained by the mixture of process1 and process2, I came up with the following procedure:
I test all values of f=seq(1/100,99/100,1/100)
if observed
can be explained by a mixture of f
process1 and 1-f
process2
f <- 1/3
prob_mix <- prob1 * (1-f) + prob2 * (f)
Again I can use this merged or mixed probabilities in a chi-squared test if this explains my observed
variable better.
chi2_statistic <- sum( (observed-(N*prob_mix))^2/(N*prob_mix) )
pchisq(chi2_statistic,df=4-2, lower.tail=FALSE)
Two questions:
Do you think such an procedure is statistically sound? (I tested it with simulated data and do like the results)
For the
pchisq
: what is the appropriate number of degrees of freedom since I have still 4 categories but somehow merged probabilities. My intuition tells me that the number of degrees of freedom is larger, but I'd like to have favoured to explain my data without this mixture process; therefore I should include a smaller df (therefore I used 2 so far). Can somebody deduce the precise df?