Mixture of Gaussians on Log of Data

Question

I am practicing Mixture of Gaussians and found the below dataset snoq, which is the precipitation amounts recorded at a US region, with NA and no precipitation days removed.

snoqualmie <- read.csv("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)
snoqualmie.vector <- na.omit(unlist(snoqualmie)) # remove NA's and flatten
snoq <- snoqualmie.vector[snoqualmie.vector > 0] # days where precipitation was greater than 0

In the exercise code, the instructor fits a mixture of 2 Gaussians to the data with the below code (the plot.normal.components function is given at the end of my question):

if(!require("mixtools")) { install.packages("mixtools");  require("mixtools") }
snoq.k2 <- normalmixEM(snoq, k=2, maxit=100, epsilon=0.01)
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,
     xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")
lines(density(snoq),lty=2)
sapply(1:2,plot.normal.components,mixture=snoq.k2)

The dotted line is the kernel density estimation (on the empirical pdf) and the two black curves are the fitted Gaussians. enter image description here

Now, I thought the distribution of snoq is similar to that of an exponential distribution, so my first instict here was to log transform the data and then investigate what happens if I try to fit a Mixture of Gaussians on log data rather than the raw data as the instructor did:

log_snoq <-  log(snoq)
log_snoq.k3 = normalmixEM(log_snoq, k=3, maxit=100, epsilon=0.01) # does not converge!
plot(hist(log_snoq,breaks=101),col="grey",border="grey",freq=FALSE,
     xlab="Log Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")
lines(density(log_snoq),lty=2)
sapply(1:3,plot.normal.components,mixture=log_snoq.k3)

I chose k=3 in this case, just because the kernel density estimator showed 3 bumps on the dotted line (in the figure below), which I interpreted as three local maxima that can be modelled by 3 univariate Gaussians. However, during EM I get a warning that says WARNING! NOT CONVERGENT!, and the resulting Gaussian components look as below: (I do realise that the log transform does not produce a beautifully Gaussian-looking distribution, but the raw data itself looks less suitable to be modelled with a (mixture of) Gaussian(s) to me anyway.)

enter image description here

My question is, in this particular example is it wrong to log-transform? Does this in general complicate the application of Mixture of Gaussians? Any recommendations / comments are welcome!

The function plot.normal.components:

plot.normal.components <- function(mixture,component.number,...) {
  curve(mixture$lambda[component.number] *
          dnorm(x,mean=mixture$mu[component.number],
                sd=mixture$sigma[component.number]), add=TRUE, ...)
}

Sorry, it was already there, but at the bottom so probably not very visible — Zhubarb, Feb 04 '15 at 08:33
For a similar analysis as part of EDA for my dissertation research software, I wanted to automate mixture analysis as much as possible. Therefore, I've used `mclust` package to analytically determine number of mixture components and then pass to `normalmixEM()`, etc. You can take a look at [that module](https://github.com/abnova/diss-floss/blob/master/analysis/mixDist.R). This worked well for my data, but when I've tried similar code for yours, it failed to detect the correct number of components (2 or 3) - it detected only one. Not sure why. — Aleksandr Blekh, Feb 04 '15 at 15:11
@AleksandrBlekh, did you try it on the raw or log(data)? Also, have you seen that in the original code on the link I gave, the instructor calculates the final loglikelihoods of different component numbers by 2-fold cross validation (fitting the model trained on half the dataset to the other half). — Zhubarb, Feb 04 '15 at 15:29
I've tried on both (making appropriate adjustments). As for the cross-validation, I don't that it's necessary for the approach to work (as I said, it worked well for my data). — Aleksandr Blekh, Feb 04 '15 at 15:35

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

1

Some comments, advice and answers:

Unless I'm missing something, I don't fully understand why for some reason you've used different number of mixture components in the second and third blocks of code (3 vs. 2). However, looking at the output of summary() for the mixture object, it seems that the 2nd component is negligible, so I assume that you decided to ignore it, hence the change.
Warning "NOT CONVERGENT!" can be alleviated in many cases by increasing the number of iterations (for example, I've used the value of 500 instead of the default 100 for my similar analysis). I've tried your code and it converges after 124 iterations, so maxit=200 should be enough.
I don't think that it's wrong (harmful) to log-transform in this case and it seems that it is even beneficial, as it exposes the structure of the data. However, we definitely need to be careful with log transformations: https://stats.stackexchange.com/a/130275/31372.

edited Apr 13 '17 at 12:44

Community

1

answered Feb 04 '15 at 12:37

Aleksandr Blekh

7,867
2
27
93

Thanks Aleksandr, just to reconfirm then, you agree: 1) log transforming the data (before running mixt of Gaussians) is a better (or at least justifiable) design choice given the e-pdf of this dataset; 2) there is no side-effect in terms of EM convergence when we operate on log data. (and the fact that I get a non-convergence warning is kind of a red herring) – Zhubarb Feb 04 '15 at 12:47
1

@Zhubarb: You're welcome. Well, I can confirm #1 (log transformation) for this particular case (data set) and to the best of my knowledge. However, I cannot confirm #2 (the absence of side effects). It makes sense IMHO to take a look at the E-M algorithm, just to be sure. Or to wait for other people's feedback. – Aleksandr Blekh Feb 04 '15 at 12:55
1

Re #2: the first thing I thought was potential loss of numerical precision due to log transform and that causing problems in EM convergence - but then again log is a monotonic function and that should actually not be an issue. Practically, it should not be different than finding MLE using loglikelihood. – Zhubarb Feb 04 '15 at 12:57
1

@Zhubarb: I believe you :-). I was curious enough to perform a brief research and it seems that you're right (for example, see [this paper](https://www.udc.edu/docs/dc_water_resources/technical_reports/report_n_187.pdf) and [this dissertation](http://arxiv.org/pdf/1411.6622.pdf)). Some researchers (see [this paper](http://download.springer.com/static/pdf/580/art%253A10.1007%252Fs11434-012-5485-4.pdf?auth66=1423055404_ad37a9fd84c486c08c7fbc5b20ea7818&ext=.pdf)) even introduce log transformation into EM algorithm in order to improve certain aspects (in this case, for better classification). – Aleksandr Blekh Feb 04 '15 at 13:24
That is cool, I will check the references out, thank you :) . – Zhubarb Feb 04 '15 at 13:27
@Zhubarb: My pleasure :). – Aleksandr Blekh Feb 04 '15 at 13:31

score 1 · Answer 2 · answered Jul 23 '20 at 19:51

Both fitting a mixture of Gaussians to the transformed data and transforming the data prior to fitting appear to be valid approaches, but you may get very different results. Depending on the purpose of your analysis, this may be a problem.

Let's say we are interested in determining the order of a Gaussian mixture. In the example below, we will see that log transforming changes the results (in a dramatic and very boring way):

Let's create an a 2-dimension empirical distribution that is normal on one margin and lognormal on the other:

set.seed(235)
dat<-data.frame(margin1=rnorm(n=10000,mean=21,sd=6),
                margin2=rlnorm(n=10000,meanlog=5))
dat<-dat[dat$margin1>0,] #drop negative values
sample<-dat[sample(rownames(dat),size=3000,replace=TRUE),]

Now let's model it as a mixture of Gaussians:

library(mclust)
sampleBIC<-mclustBIC(sample)

summary(sampleBIC)
             VVI,5        VVI,6        VVE,5
BIC      -58089.92 -58100.22383 -58102.78384
BIC diff      0.00    -10.30452    -12.86453

Using the standard BIC criterion we would select a 5 component mixture model. But look what happens if we log transform the lognormal variable first:

lSample<-data.frame(margin1<-sample$margin1,margin2<-log(sample$margin2))
lSampleBIC<-mclustBIC(lSample)
summary(lSampleBIC)
             EEI,1     EVI,1     VEI,1
BIC      -27781.84 -27781.84 -27781.84
BIC diff      0.00      0.00      0.00

Now BIC would lead us to select a 1 component model.

This makes sense, because by log-transforming the lognormal variable we've given ourselves a roughly bivariate normal joint PDF. At least, "bivariate normal enough" that mclust has no trouble fitting a single elliptical bivariate normal density to it. Before we log-transformed our data, mclust had to fit multiple bivariate gaussians to approximate the observed data, because (even elliptical) bivariate gaussians don't fit very well into the odd asymetrical distribution of our observations.

So it would make sense, here, to think hard about your goal and the interpretability of your results before making the decision to log transform.

Mixture of Gaussians on Log of Data

2 Answers2

Linked