15

Let's say that we had an information for men and women heights.

R code:

set.seed(1) 
Women=rnorm(80, mean=168, sd=6) 
Men=rnorm(120, mean=182, sd=7) 
par(mfrow=c(2,1)) 
hist(Men, xlim=c(150, 210), col="skyblue") 
hist(Women, xlim=c(150, 210), col="pink")

Unfortunately something happened and we lost the information who is women and who is men.

R code:

heights=c(Men, Women) 
par(mfrow=c(1,1)) 
hist(heights, col="gray70") 
rm(women, men) 

Could we somehow estimate women and men mean heights and standard deviation using maximum likelihood method?

We know that men and women heights are normally distributed.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467

2 Answers2

19

This is a classic unsupervised learning problem that has a simple maximum likelihood solution. The solution is a motivating example for the expectation maximization algorithm. The process is:

  1. Initialize group assignment
  2. Estimate the group-wise means and likelihoods.
  3. Calculate the likelihood of membership for each observation to either group
  4. Assign group labels based on MLE

Repeat steps 2-4 until convergence, i.e. no reassigned group.

WLOG I can assume I know there are 80 out of all 200 who are women. Another thing to note, if we don't build in the assumption that women are shorter than men, a clustering algo isn't too discerning about which group is labeled as which, and it's interesting to note the cluster label assignment can be reversed.

    set.seed(1) 
    Women=rnorm(80, mean=168, sd=6) 
    Men=rnorm(120, mean=182, sd=7) 
    AllHeight <- c(Women, Men)
    trueMF <- rep(c('F', 'M'), c(80, 120))
    
    ## case1  assume women are shorter, so assign first 
    ## 80 lowest height
    MF <- ifelse(order(AllHeight) <= 80, 'F', 'M')
    
    ## case 2 try randomly allocating 
    # MF <- sample(trueMF, replace = F)
    
    steps <- 0
    
    repeat {
      steps <- steps + 1
      mu <- tapply(AllHeight, MF, mean)
      sd <- tapply(AllHeight, MF, sd)
      logLik <- mapply(dnorm, x=list(AllHeight), mean=mu, sd=sd, 
                        log=T)
      MFnew <- c('F', 'M')[apply(logLik, 1, which.max)]
      if (all(MF==MFnew)) break
      else MF <- MFnew
    }
    
    ## case 1: 
    # 85% correct
    # 2 steps
    # Means
    # F        M 
    # 168.7847 183.5424 
    
    ## case 2:
    ## 15% correct
    ## 7 steps
    # F        M 
    # 183.5424 168.7847 
    
    ## what else?
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
AdamO
  • 52,330
  • 5
  • 104
  • 209
15

What you are describing is a mixture of two Gaussians.

$$ f(x) = \pi \, \mathcal{N}(\mu_1, \sigma_1^2) + (1 - \pi) \, \mathcal{N}(\mu_2, \sigma_2^2) $$

where $\pi \in (0, 1)$ is a mixing proportion. Notice that to find the means and standard deviations of both groups, you would need to know the group assignments. If you were to find group assignments, the best way would be by assigning the observations closest to the mean of each group to the cluster. This is a chicken and egg problem. The problem can be solved by using Expectation-Maximization algorithm that starts with assigning the groups randomly, given them calculates the parameters, then re-classifies the observations, and repeats till convergence. There are also other algorithms, but this is the most popular one. You may know it from $k$-means clustering, which is a special case of a Gaussian mixture.

Tim
  • 108,699
  • 20
  • 212
  • 390