0

Good day everyone

At the moment I am attempting to write code in R to calculate the following.

$$ \tau_{k j}^{(m)}=\frac{\pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right)}{f\left(x_{j} ; \Theta^{(m)}\right)}=\frac{\pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right)}{\sum_{k=1}^{K} \pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right)} $$

$f_k$ is defined as follow $$ f_{k}\left(x ; \theta_{k}\right)=\frac{1}{(2 \pi)^{P / 2}|V|} \exp \left(-\frac{1}{2}\left(x-\mu_{k}\right)^{\prime} V^{-1}\left(x-\mu_{k}\right)\right), $$ where $V=\operatorname{diag}\left(\sigma_{1}, \sigma_{2}, \ldots, \sigma_{P}\right)$, and $|V|=\prod_{p=1}^{P} \sigma_{p} .$

$j$ is the number of observation, $j=1,....100$

$k$ is the number of clusters, $k=1,2$

$p$ is the number of variables, $p=1,...,1000$

Below is the code for what I have done so far

#packeges

library(reticulate)
library(ggplot2)
library(mvtnorm)
library(matlib)


############## Simulate the data #############

c1 <- 85
c2 <- 15
n=100 # observations (85 in c1 and 15 in c2)
inf <- 150
uninf <- 850
p <- 1000 # variables
#150 informative variables and 850 uninformative
# first 150: for c1 from N(0,1) and c2 from N(1.5,1) 
# last 850 from N(0,1) for all n=100

Data <- matrix(0,n,p)

for (i in 1:inf) {
  Data[1:c1,i] <- rnorm(c1,0,1)
  Data[(c1+1):n,i] <- rnorm(c2,5,1)
}



for (i in (inf+1):p) {
  Data[,i] <- rnorm(n,0,1)
}


# #STANDARDIZE YOUR DATA
 df = as.data.frame(Data)
 funs <- function(x) { c(scl=scale(x))}
 Data <- sapply(df, funs)

 
############# Step 1#############
#Initialize values

k <- 2 #There are K clusters
pi_vec <- matrix(c(0.7,0.3),1,k)
lambda <- 1.5

df = as.data.frame(Data)
funs <- function(x) { c(var=var(x))}   
col_var <- sapply(df, funs)
V <- diag(col_var)

#mean
mu <- matrix(rnorm((p*k),0,1),p,k,byrow = FALSE)


############# Step 2#############
 
  #E-STEP
  top_posterior = numeric(0)

  for(j in 1:k){

    top_posterior = cbind(top_posterior,pi_vec[j]*dmvnorm(Data,mu[,j],V))
  }
  

  tau_kj = top_posterior/rowSums(top_posterior)


Problem is all the values in my top_posterior variable are 0. I'm pretty sure this is because the multivariate normal has such a high dimension that all the values generated by dmvnorm are so small that r simply cannot store it, and hence makes it 0.

Here is what I've tried so far.

  1. I tried calculating $log(\tau_{k j}^{(m)})$ but I still run into problems when calculating the denominator because I have $ log(\sum_{k=1}^{K} \pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right))$ and $ f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right) $ still causes and underflow issue.
  2. Also tried just taking $log(f_{k}\left(x; \theta_{k}\right))$ but once again the denominator causes a problem since I have to take the exponential again before summing and so this causes yet another underflow issue.

It seems the denominator is really making life difficult for me. Am I missing something here? Any help would be greatly appreciated.

Susan-l3p
  • 83
  • 5
  • When computing $\log f_k(x ; \theta_k))$ are you doing this as `log( dmvnorm(Data,mu[,j],V))` or via `dmvnorm(Data,mu[,j],V, log = TRUE)`? The latter expression will likely be much more numerically stable than the first. – jcken Sep 03 '21 at 10:12
  • I actually did it from first principles. So I used `log((1/((2*pi)^(p/2)*det(V))))-(0.5*(t(Data[i,])-mu[,j])%*%inv(V)%*%t(t(Data[i,])-mu[,j])) `. – Susan-l3p Sep 03 '21 at 11:54
  • computing `log(det(V))` can be problematic when it is performed naively. Replace it with `ldetV – jcken Sep 03 '21 at 12:17

0 Answers0