Good day everyone
At the moment I am attempting to write code in R to calculate the following.
$$ \tau_{k j}^{(m)}=\frac{\pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right)}{f\left(x_{j} ; \Theta^{(m)}\right)}=\frac{\pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right)}{\sum_{k=1}^{K} \pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right)} $$
$f_k$ is defined as follow $$ f_{k}\left(x ; \theta_{k}\right)=\frac{1}{(2 \pi)^{P / 2}|V|} \exp \left(-\frac{1}{2}\left(x-\mu_{k}\right)^{\prime} V^{-1}\left(x-\mu_{k}\right)\right), $$ where $V=\operatorname{diag}\left(\sigma_{1}, \sigma_{2}, \ldots, \sigma_{P}\right)$, and $|V|=\prod_{p=1}^{P} \sigma_{p} .$
$j$ is the number of observation, $j=1,....100$
$k$ is the number of clusters, $k=1,2$
$p$ is the number of variables, $p=1,...,1000$
Below is the code for what I have done so far
#packeges
library(reticulate)
library(ggplot2)
library(mvtnorm)
library(matlib)
############## Simulate the data #############
c1 <- 85
c2 <- 15
n=100 # observations (85 in c1 and 15 in c2)
inf <- 150
uninf <- 850
p <- 1000 # variables
#150 informative variables and 850 uninformative
# first 150: for c1 from N(0,1) and c2 from N(1.5,1)
# last 850 from N(0,1) for all n=100
Data <- matrix(0,n,p)
for (i in 1:inf) {
Data[1:c1,i] <- rnorm(c1,0,1)
Data[(c1+1):n,i] <- rnorm(c2,5,1)
}
for (i in (inf+1):p) {
Data[,i] <- rnorm(n,0,1)
}
# #STANDARDIZE YOUR DATA
df = as.data.frame(Data)
funs <- function(x) { c(scl=scale(x))}
Data <- sapply(df, funs)
############# Step 1#############
#Initialize values
k <- 2 #There are K clusters
pi_vec <- matrix(c(0.7,0.3),1,k)
lambda <- 1.5
df = as.data.frame(Data)
funs <- function(x) { c(var=var(x))}
col_var <- sapply(df, funs)
V <- diag(col_var)
#mean
mu <- matrix(rnorm((p*k),0,1),p,k,byrow = FALSE)
############# Step 2#############
#E-STEP
top_posterior = numeric(0)
for(j in 1:k){
top_posterior = cbind(top_posterior,pi_vec[j]*dmvnorm(Data,mu[,j],V))
}
tau_kj = top_posterior/rowSums(top_posterior)
Problem is all the values in my top_posterior
variable are 0. I'm pretty sure this is because the multivariate normal has such a high dimension that all the values generated by dmvnorm
are so small that r simply cannot store it, and hence makes it 0.
Here is what I've tried so far.
- I tried calculating $log(\tau_{k j}^{(m)})$ but I still run into problems when calculating the denominator because I have $ log(\sum_{k=1}^{K} \pi_{k}^{(m)} f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right))$ and $ f_{k}\left(x_{j} ; \theta_{k}^{(m)}\right) $ still causes and underflow issue.
- Also tried just taking $log(f_{k}\left(x; \theta_{k}\right))$ but once again the denominator causes a problem since I have to take the exponential again before summing and so this causes yet another underflow issue.
It seems the denominator is really making life difficult for me. Am I missing something here? Any help would be greatly appreciated.