Finding category with maximum likelihood method

Question

Let's say that we had an information for men and women heights.

R code:

set.seed(1) 
Women=rnorm(80, mean=168, sd=6) 
Men=rnorm(120, mean=182, sd=7) 
par(mfrow=c(2,1)) 
hist(Men, xlim=c(150, 210), col="skyblue") 
hist(Women, xlim=c(150, 210), col="pink")

Unfortunately something happened and we lost the information who is women and who is men.

R code:

heights=c(Men, Women) 
par(mfrow=c(1,1)) 
hist(heights, col="gray70") 
rm(women, men)

Could we somehow estimate women and men mean heights and standard deviation using maximum likelihood method?

We know that men and women heights are normally distributed.

Though of course you'd need to add some additional information, external to the data (since it lacks labels) to know which mixture component was which... — Glen_b, Jan 01 '22 at 06:17

score 19 · Accepted Answer · edited Dec 07 '21 at 02:14

This is a classic unsupervised learning problem that has a simple maximum likelihood solution. The solution is a motivating example for the expectation maximization algorithm. The process is:

Initialize group assignment
Estimate the group-wise means and likelihoods.
Calculate the likelihood of membership for each observation to either group
Assign group labels based on MLE

Repeat steps 2-4 until convergence, i.e. no reassigned group.

WLOG I can assume I know there are 80 out of all 200 who are women. Another thing to note, if we don't build in the assumption that women are shorter than men, a clustering algo isn't too discerning about which group is labeled as which, and it's interesting to note the cluster label assignment can be reversed.

    set.seed(1) 
    Women=rnorm(80, mean=168, sd=6) 
    Men=rnorm(120, mean=182, sd=7) 
    AllHeight <- c(Women, Men)
    trueMF <- rep(c('F', 'M'), c(80, 120))
    
    ## case1  assume women are shorter, so assign first 
    ## 80 lowest height
    MF <- ifelse(order(AllHeight) <= 80, 'F', 'M')
    
    ## case 2 try randomly allocating 
    # MF <- sample(trueMF, replace = F)
    
    steps <- 0
    
    repeat {
      steps <- steps + 1
      mu <- tapply(AllHeight, MF, mean)
      sd <- tapply(AllHeight, MF, sd)
      logLik <- mapply(dnorm, x=list(AllHeight), mean=mu, sd=sd, 
                        log=T)
      MFnew <- c('F', 'M')[apply(logLik, 1, which.max)]
      if (all(MF==MFnew)) break
      else MF <- MFnew
    }
    
    ## case 1: 
    # 85% correct
    # 2 steps
    # Means
    # F        M 
    # 168.7847 183.5424 
    
    ## case 2:
    ## 15% correct
    ## 7 steps
    # F        M 
    # 183.5424 168.7847 
    
    ## what else?

score 15 · Answer 2 · answered Dec 06 '21 at 17:53

What you are describing is a mixture of two Gaussians.

$$ f(x) = \pi \, \mathcal{N}(\mu_1, \sigma_1^2) + (1 - \pi) \, \mathcal{N}(\mu_2, \sigma_2^2) $$

where $\pi \in (0, 1)$ is a mixing proportion. Notice that to find the means and standard deviations of both groups, you would need to know the group assignments. If you were to find group assignments, the best way would be by assigning the observations closest to the mean of each group to the cluster. This is a chicken and egg problem. The problem can be solved by using Expectation-Maximization algorithm that starts with assigning the groups randomly, given them calculates the parameters, then re-classifies the observations, and repeats till convergence. There are also other algorithms, but this is the most popular one. You may know it from $k$-means clustering, which is a special case of a Gaussian mixture.

Finding category with maximum likelihood method

2 Answers2