6

I have a vector of numeric values. My hypothesis is that this vector is a mixture drawn from two Gaussian distributions (ie k = 2). However, it is possible that there is only one Gaussian underlying my data (k = 1). I am attempting to answer this question in a data-driven manner but do not know the best method?

My thought was to compare the two methods by calculating the BIC or AIC for each, and then performing a log-likelihood test.

  1. Should I include k as one of the parameters being estimated when I calculate BIC (ie {mu1, sd1, mu2, sd2, k} vs {mu1, sd1, k} for the 2-component and 1-component models respectively)

  2. I'm using the mixtools package in R and the normalmixEM() function does not seem to allow fitting a 1-component gaussian (ie if I use k = 1 I get an error arbmean and arbvar cannot both be FALSE)

  3. If using a LR with AIC/BIC is not appropriate, is there a more appropriate solution to this problem?

Edit: I found a somewhat illuminating example here. This approach uses the mclust package to fit a 1 vs 2 component gaussian mixture and use the model log-likelihood to perform a likelihood ratio test.

Maxim
  • 3,164
  • 1
  • 17
  • 25
Brandon
  • 296
  • 1
  • 10
  • [This paper](https://arxiv.org/abs/0908.3428) and [this paper](http://www.tandfonline.com/doi/abs/10.1080/01621459.2012.695668) should be helpful, they also have a R package: [`MixtureInf`](https://cran.r-project.org/web/packages/MixtureInf/index.html) – Francis Dec 20 '17 at 13:26
  • Those are indeed helpful, many thanks! I will take a look at their implementation in the package. – Brandon Dec 20 '17 at 17:02

1 Answers1

1

An alternative strategy is to test for Normality. If your data comes from a single Gaussian, you should fail to reject the null hypothesis. Conversely, if you get a statistically significant p-value for rejecting the null hypothesis, then you know that k > 1. This strategy can be easily generalized to the multi-variate case by performing PCA and testing each principal component separately.

Since you're working with R, I recommend you take a look at the nortest package.