Goodness of fit test for a mixture in R

Question

I just estimated the parameters for a mixture of two gaussians with different means and different sigmas, I would like to test if the data adjusts well to the explicit form of the mixture, do I necessarily need to simulate data from the mixture or how can I test the goodness of fit? I am using the mixtools package.

Question flagged for migration, so you don't have to add a new question to that site. — Joris Meys, May 21 '12 at 10:04
A nonparametric option is the [Kolmogorov-Smirnov test](http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) but there are probably some ad hoc tools for the case of Gaussian mixtures. — , May 21 '12 at 15:57

score 8 · Accepted Answer · answered May 21 '12 at 16:37

You can write a function that calculates the relevant value for a given test under the null hypothesis based on the outputs from, say, normalmixEM, then do the test using that. For example, for the Kolmogorov-Smirnov test, we need the CDF given a set of parameters:

# CDF of mixture of two normals
pmnorm <- function(x, mu, sigma, pmix) {
  pmix[1]*pnorm(x,mu[1],sigma[1]) + (1-pmix[1])*pnorm(x,mu[2],sigma[2])
}

We then run the K-S test in the usual way:

# Sample run
x <- c(rnorm(50), rnorm(50,2))

foo <- normalmixEM(x)
test <- ks.test(x, pmnorm, mu=foo$mu, sigma=foo$sigma, pmix=foo$lambda)
test

    One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.0559, p-value = 0.914
alternative hypothesis: two-sided

Always keeping in mind the fact that we estimated the parameters from the same data we are using to do the test, thus biasing the test towards failure to reject H0.

We can overcome that latter bias to some extent via a parametric bootstrap - generating many samples from a mixture of normals parameterized by the estimates from normalmixEM, then estimating the parameters of the samples and calculating the test statistics for each sample using the estimated parameters. Under this construction, the null hypothesis is always true. In the code below, I'm helping the EM algorithm out by starting at the true parameters for the sample, which is also cheating a little, as it makes the EM algorithm more likely to find values near the true values than with the original sample, but greatly reduces the number of error messages.

# Bootstrap estimation of ks statistic distribution
N <- length(x)
ks.boot <- rep(0,1000)
for (i in 1:1000) {
  z <- rbinom(N, 1, foo$lambda[1])
  x.b <- z*rnorm(N, foo$mu[1], foo$sigma[1]) + (1-z)*rnorm(N, foo$mu[2], foo$sigma[2])
  foo.b <- normalmixEM(x.b, maxit=10000, lambda=foo$lambda, mu=foo$mu, sigma=foo$sigma)
  ks.boot[i] <- ks.test(x.b, pmnorm, mu=foo.b$mu, sigma=foo.b$sigma, pmix=foo.b$lambda)$statistic
}

mean(test$statistic <= ks.boot)
[1] 0.323

So instead of a p-value of 0.914, we get a p-value of 0.323. Interesting, but not particularly important in this case.

Does anything change in the following case? I have 100 simulated obs from a mixture of two Normals. So i know the means and variances of the "true" mixture. I generate from the predictive distribution 20000 samples. How can i test if those samples are from the specific mixture? I believe i have to skip the foo variable and substitute the known means and variances in the ks.test right? or i am missing something bigger here? — Christos, Nov 20 '13 at 14:48
@Christos - Your intuition is correct - better to substitute the known means and variances in the ks test, and, if you do so, you don't need to bootstrap the ks test statistic either. — jbowman, Nov 20 '13 at 14:53
Could you please explain how a $p=0.914$ implies that the data from the mixture distribution? I thought failure to reject H0 did not mean that. — Jacob, May 05 '14 at 11:58
@Jacob - where did I say that a p-value of 0.914 implies that the data is from a mixture distribution? — jbowman, May 05 '14 at 17:30
@jbowman: I'm sorry, you obviously don't. However, is there a statistical test to establish equivalence? I.e. given a CDF and a small sample of data, is there a statistical test which can prove that the data follows the CDF? — Jacob, May 05 '14 at 20:33
@Jacob - unfortunately, no. Consider for example data that is drawn from a Normal(0,1) distribution. Now consider an alternative where the data is drawn from a Normal(0, 1 + 1.0e-9999999999999999999999999) distribution. You would need a large, not small, sample size to distinguish between the two. And if you had a large enough sample, we can always construct even smaller differences between the two CDFs that your sample won't be large enough to distinguish between. — jbowman, May 06 '14 at 17:13
@jbowman: Thank you for explaining this to me. Would things be different if I wanted to test the equality of a statistic? E.g. non-parametric equivalence of means test. I'm trying to find what I need from Stefan Wellek's "Testing Statistical Hypotheses of Equivalence" ; do you think that's the right direction? — Jacob, May 07 '14 at 18:17

Goodness of fit test for a mixture in R

1 Answers1

Linked

Related