Splitting of bimodal distribution, use in regression models

Question

I have a bimodal length-frequency distribution for the females of a species with a one-year life span. This pattern is not observed in the males.

I suspect that the bimodality is due to different hatching times and the associated environmental conditions. I would like to separate these distributions and see if the environmental variables can explain the variability of the population.

From reading this question, I thought that I can do this using finite mixture modelling, more specifically the mixtools package in R and the normalmixEM function.

Sample of my data:

 Year Period Sex Length.int
1 2000      E   F         28
2 2000      E   F         26
3 2000      E   F         21
4 2000      E   F         25
5 2000      E   F         23
6 2000      E   F         24

The output of the model:

summary of normalmixEM object:
          comp 1    comp 2
lambda  0.553918  0.446082
mu     24.039199 29.286508
sigma   1.515261  2.027977
loglik at estimate:  -1511.065

My question is, how do I use the parameters from the output to actually split the bimodal distribution into two unimodal distributions?

Is this what I should be doing in the first place? Is it statistically sound? Some of the comments here lead me to think that it is not necessary to be performing this splitting, especially since I will most likely be using a regression model. However, I have data from the last twenty years, and I want to try to explain the variation in length of this species, both within and between different years. In this case does it matter if my length variable is bimodal for some of the years I will be comparing to one another?

The classic reason for bimodality in a variable representing the size of an organism is the organism's sex. In case of humans the males are on average larger than the females. — Maarten Buis, Oct 06 '19 at 18:28
I should have clarified, this is the females of the population, the sexes are already separated, we don’t see this type of bimodality in the males, they usually have a pretty normal bell curve distribution — watermineporcupine, Oct 06 '19 at 20:08
Re the hatching time, not yet, I am hoping to age some of the animals later on in order to be able to groundtruth this. — watermineporcupine, Oct 06 '19 at 20:09
Once you have the hatching time, I would split based on that (if I split at all). — Peter Flom, Oct 07 '19 at 11:06

score 1 · Accepted Answer · answered Oct 07 '19 at 11:12

The comments you refer to in your last paragraph are correct, but perhaps misleading. It is true that regression does not make an assumption about the distribution of the dependent variable (it assumes things about the errors).

But just because a model doesn't violate assumptions doesn't mean it is a good model. Remember that the usual regression models are models of the mean. Often, with a bimodal or multimodal response, the mean is not interesting. Often you would not use it as a measure of location -- in fact, there might not be a single good measure of location. So, if you aren't interested in the mean, why model it?

One way around this is quantile regression. Here you could regress on the quantiles that are peaks of your combined data.

Thank you, this makes sense, reading up of quantile regression has clarified this! — watermineporcupine, Oct 08 '19 at 17:36
Also I found this review useful A gentle introduction to quantile regression for ecologists http://www.sortie-nd.org/lme/Course_Schedule_2011/Day_4/Quantile%20Regression%20for%20Ecologists.pdf — watermineporcupine, Oct 08 '19 at 18:05

score 1 · Answer 2 · answered Mar 27 '20 at 10:25

Using the ggpmisc package might help :

densityCurve <- ggplot(df, aes(x=MyVariable)) + geom_density()
# extract the data from the graph
densityCurveData <- ggplot_build(densityCurve)
# get the indices of the local minima
localMins <- which(ggpmisc:::find_peaks(-densityCurveData$data[[1]]$density) == TRUE)
# get the value of the local minima
localMins <- densityCurveData$data[[1]]$x[localMins]
localMins <- c(-Inf, localMins, +Inf)

You can now split your data according to the following graph (using cut function):

ggplot(df, aes(x=MyVariable)) + geom_density() + geom_vline(xintercept = localMins, color="red", linetype = "dashed")

Hope this help.

Splitting of bimodal distribution, use in regression models

2 Answers2