1

Assume $X$ is a binary treatment variable, $Y$ is a continuous variable measured pre- and post-treatment $(Y_{pre}$, $Y_{post})$, and $Z$ represents the remaining covariates. Both $Y_{pre}$ and $Y_{post}$ are subject to left-censoring.

Does anyone know an approach to analyze the effect of $X$ on $Y_{post}$ adjusted for $Y_{pre}$ and $Z$? Tobit regression (e.g.) can handle the situation where only $Y_{post}$ is censored, and various methods are suggested in the case where only $Y_{pre}$ (and/or $Z$) is censored. But what if both $Y_{pre}$ and $Y_{post}$ are subject to left-censoring?

andrjens
  • 11
  • 1
  • 1
    I usually think of time-to-event variables with left-censoring. Can you explain what kind of variable or setting has $Y_{pre}$ and $Y_{post}$ subject to left censoring? – AdamO Jul 27 '21 at 05:39
  • Thanks for your reply. I acknowledge that the terminology in the literature is ambiguous. What I mean is that for some observations of Y it is only known that the value is below a certain number. Should I change the original post? – andrjens Jul 27 '21 at 07:23
  • Since we usually condition on $Y_{pre}$ we don't need as much to think of its distribution (unlike what we do for $Y_{post}$). It's just that when we condition on $Y_{pre} = y$ and $y$ is at the lower detectable limit we are really conditioning on $Y_{pre} \leq y$ and this incomplete conditioning will result in a larger residual variance for $Y_{post}$. So you might posit a flexible nonlinear model in $Y_{pre}$ with even a spike at the lower limit, but having two variance terms for $Y_{post}$. I'm not sure how this would play with categorical $Y_{post}$. – Frank Harrell Jul 27 '21 at 12:33
  • I think if you let $\Delta = Y_{post} - Y_{pre}$ be your outcome measure you can derive that this is interval censored and derive the interval bounds based on which of the two are below the limit of detection. – bdeonovic Jul 27 '21 at 20:51

1 Answers1

0

If $Y_{pre}$ and $Y_{post}$ are, for instance, lab analytes where the assay can only detect down to a lower limit of detection, you can use an EM algorithm to impute the values according to some distributional assumption.

To illustrate with a univariate case: consider first the M step. To calculate the likelihood, an order statistic result is to just input the CDF functional. For instance:

set.seed(123)
n <- 100
ypre <- rlnorm(n = n, meanlog = -3, sdlog = 3)
ypre[ypre < 1] <- 1

negloglik <- function(param, x, cens) {
  -sum(pnorm(x[cens], mean = param[1], sd=param[2], log.p = T)) -
    sum(dnorm(x[!cens], mean=param[1], sd=param[2], log = T))
}

nlm(f = negloglik, p=c(0,1), x=log(ypre), cens = ypre==1)

Gives (with lots of warnings because the nlm algo is a bit rough)

$estimate
[1] -2.509127  2.642844

Which we can then apply expectation (E-step) based on the conditional expectation

paramest <- nlm(f = negloglik, p=c(0,1), x=log(ypre), cens = ypre==1)$estimate
pcens <- mean(ypre == 1)
expval <- integrate(function(x) x*dnorm(x, mean=paramest[1], sd = paramest[2]),
  lower=-Inf, upper=qnorm(pcens))
ypre_e <- ifelse(ypre==1, exp(expval), ypre)

Now this is considerably more complicated with multivariate analyses. If we are able to assume normality for the $Y$s (or some transformation), we can use linear regression as a maximum likelihood process. We can factor the joint likelihood of Ypre, X, and Ypost into the likelihood of Ypre and of Ypost|X, Ypre. I think we can thus develop the algorithm:

  1. Initialize parameter estimates for Ypre, the regression parameters for the mean Ypost response adjusting for Ypre and treatment, and the residual standard error of Ypost
  2. Calculate the expected value of Ypre
  3. Estimate the regression using a maximum likelihood procedure with the conditional mean of Ypost given by the current regression estimates and current residual variance
  4. Update the regression estimates and residual variance according to the MLE.
  5. Repeat steps 3 and 4 until convergence.
AdamO
  • 52,330
  • 5
  • 104
  • 209