5

I am using the bootstrap algorithm to compute standard errors of the estimates of my normalmixEM output. I am not really sure if they are reliable?

My code is (data here):

# load package
install.packages("mixtools")
library(mixtools)


B = 1000 # Number of bootstrap samples
mu1sample <- mu2sample <- sigma1sample <- sigma2sample <- lambdasample <- vector()

# Bootstrap
for(i in 1:B){
  print(i)
  subsample = sample(mydatatest,rep=T)
  normalmix <- normalmixEM(subsample, mu=c(-0.002294,0.002866),sigma=c(0.00836,0.02196), lambda=c(0.6746903,(1-0.6746903)),k=2, fast=FALSE, maxit=10000, epsilon = 1e-16, maxrestarts=1000)
  mu1sample[i]    = normalmix$mu[1]      # $
  mu2sample[i]    = normalmix$mu[2]      # $
  sigma1sample[i] = normalmix$sigma[1]   # $
  sigma2sample[i] = normalmix$sigma[2]   # $
  lambdasample[i] = normalmix$lambda[1]  # $
}
# standard errors

sd(mu1sample)
sd(mu2sample)
sd(sigma1sample)
sd(sigma2sample)
sd(lambdasample)

# show distribution of the bootstrap samples
hist(mu1sample)
hist(mu2sample)
hist(sigma1sample)
hist(sigma2sample)
hist(lambdasample)

This gives the following pictures:

mu1

m1

mu2

m2

sigma1

s1

sigma2

s2

lambda

la

EDIT: If you look at my variable, the mydatatest and use a KD to show the distribution with the following code

plot(density(mydatatest),col="red",main="",lwd=2,cex.axis=1.2,cex.lab=1.2)

it looks like tt

2nd EDIT: I now included the mus and sigmas to be fixed. I updated the code and the pictures. Now again my question, what do you think about it?

Jen Bohold
  • 1,410
  • 2
  • 13
  • 19
  • 2
    What does "unreliable" mean here? Puzzling? I can't read your data but (for example) a bimodal bootstrap sample is very common when outliers are present: the outlier(s) will appear in some samples but not others. There are many relatives of this situation. Closer inspection of the raw data can often make much clearer what is going on. More to the point, it seems that you are fitting a mixture of normals. Is this exactly what you might expect with a mixture of normals? – Nick Cox Aug 10 '13 at 17:54
  • Sorry, you have to add an ".RData" after the filename to make it work. Yes I am fitting a mixture of normals, I want to have the standard errors of the estimates. I hope you can read my data now? – Jen Bohold Aug 10 '13 at 18:38
  • Too much like hard work for me if I have to read it into R first, not a language I use routinely. The real question is whether you take my point or can rebut it as not the right explanation. – Nick Cox Aug 10 '13 at 18:48
  • @NickCox Well, I am not 100% what you are aiming at? I am fitting a mixture of normals (two normals). So there is a market regime which represents the extreme "market scenario". I guess that's why the lambda estimate has two peaks? The results of the algorithm are not always the same if you run it without bootstrap. So sometimes the probability of pi (or lambda) (scenario 1) is e.g. 0.3 in the next time the vice versa way of 0.7. Sometimes even other numbers occur? I am not a professional on this topic, so maybe you can tell me what you need further to know? – Jen Bohold Aug 11 '13 at 08:58
  • @NickCox I added a kernel density of my data. Maybe not the typical data for a mixture of normals, but later on I do it with a different data set. And there it makes sense, because there is the big bell curve and thenn a small additional peak in the right tail, which I want to model with the second normal distribution. So the mixture will contain the big bell curve and the small additional peak. – Jen Bohold Aug 11 '13 at 09:03
  • You haven't explained what you mean by "unreliable". My point is that bimodality here need not be surprising if you think that each bootstrap sample cannot be an exact replica of the data. You need to think about your results in relation to your _raw_ data, which we can't see, experts or otherwise. A kernel density plot does not make completely clear exactly what is happening in the tails. – Nick Cox Aug 11 '13 at 09:04
  • Well I think bimodality is to be expected, since these are stock returns. There is a big bell curve representing the normal market environment and another somehow representing a market crash regime. That's how I explain it. But I wanted to ask, if these values of the standard errors are reliable? It seems, especially the distribution of the lambda to be quite asymmetric, even if one knows the two market regimes? So what do I have to conclude? – Jen Bohold Aug 11 '13 at 09:08
  • "A kernel density plot does not make compeltely clear exactly what is happening in the tails" - what should I give you instead? – Jen Bohold Aug 11 '13 at 09:09
  • I'd most want to see a graph that shows individual data points far out in the tails. Thinking about your results further, the data seem to show a kind of labelling problem as which distribution is labelled 1 and which is labelled 2 is arbitrary. – Nick Cox Aug 11 '13 at 09:54
  • @NickCox what command do I have to use? I don't think that there is a fixed labelling? I get a pi (or lambda) which gives the probability of occurence from the first normal and 1-pi (or 1-lambda) is the probability of occurence from the second normal? – Jen Bohold Aug 11 '13 at 09:56
  • I can't add much more, as I am not experienced with this kind of modelling and don't use R routinely. Also, remember that someone making initial comments doesn't mean that they can stick with you all the way until you no longer want help. But I think of it this way: you are saying two normals with different means and SDs, but it's not built in that distribution 1 is the one with the lower SD and 2 the other. So, sometimes (often) they flip around in the output. In turn I may be misunderstanding what's going on, but that is my latest (and last) guess. – Nick Cox Aug 11 '13 at 10:13
  • Yeah, that is exactly what I also wanted to say: They flip around. But I cannot do anything against it? That's why I tried to used a fixed lambda in the command. – Jen Bohold Aug 11 '13 at 11:32

2 Answers2

5

As explained by Nick Cox and an anonymous user, what you think of as instability is just what the mixture models do: they don't care about labels unless you make it very clear that you know what your modes look like, roughly.

In terms of what you can do about fixing the labels where you need them to be, you would want to feed the full sample estimates of everything (both $\mu$s, both $\sigma$s, not just the $\lambda$ that you are feeding in now) as starting values. One can argue that this violates the spirit of maximum likelihood, but that may be the best you can do. If that does not really work, you may have to force even more information in, like insisting that $\mu_1 < \mu_2 - \delta$ and $\sigma_1 < \sigma_2 - \Delta$ and $\lambda > \frac12$. If normalmixEM() does not support that kind of cruelty to the parameter space, you would need to write your own likelihood with your own parameterization that accounts for such relations.

StasK
  • 29,235
  • 2
  • 80
  • 165
  • ok thanks for the answer and how can I fix the mus and sigmas and not only the lambda? What would the code look like? – Jen Bohold Aug 13 '13 at 08:28
  • I don't really know -- study the `normalmixEM` documentation. I wrote my own Gaussian mixture code in Stata some 15 years ago. Writing code to solve specific problems helps a lot in figuring things out about the model, in general. – StasK Aug 13 '13 at 15:14
  • I updated my answer according to your answer. What do you think about it now? – Jen Bohold Aug 14 '13 at 07:54
  • If sigma2 comes out bigger, swap the sigmas and the means you save from each run. – Nick Cox Aug 14 '13 at 08:00
  • @Jen, I think that's much better now. – StasK Aug 14 '13 at 12:47
  • Another way to visualize the data would be to plot mus against sigmas. I am guessing that they would fall into two very distinct classes. Then you can make a decision to swap the labels if you are to see that they fall into the wrong cluster on that space. – StasK Aug 14 '13 at 18:17
2

My hunch is that your approach might not be reliable due to label switching, that is, each time you fit the mixture model, it's possible that the roles of the two normal distributions has been reversed.

That is, for different runs of the EM algorithm (mu1, sigma1) and (mu2, sigma2) might be switching roles.

It looks like the boot.se function provided by mixtools tries to account for this issue.

Max S.
  • 1,666
  • 8
  • 7
  • thanks for you answer, but the problem is, that boot.se leads to very different results every time I run the procedure? – Jen Bohold Aug 13 '13 at 08:29