4

For reasons I won't go into I need to calculate parameter estimates from several imputed datasets. Based on this CV post about Rubin's rules I have determined how to manually calculate both the pooled coefficient and standard error. However the method for the p-value eludes me. If it were up to me I would make do with the pooled coefficient and se, but I know my boss will want the p-value.

Here is the toy analysis.

First the imputation.

require(mice)
set.seed(123)
nhimp <- mice(nhanes)
v <- sapply(1:5, function(i) {
  fit <- lm(chl ~ bmi, data=complete(nhimp, i))
  print(c('coef'=coef(fit)[2], 'var'=vcov(fit)[2, 2], 'p'=summary(fit)$coefficients["bmi",4]))
})

Create a matrix of extracted estimates from the model applied to each of the five complete imputed datasets. 1st column are the coefficients, 2nd column are the variances, 3rd column are the p-values.

(mat <- t(v))

#      coef.bmi      var          p
# [1,] 2.195180 5.467482 0.35758491
# [2,] 4.231113 3.603410 0.03587456
# [3,] 3.470647 5.475586 0.15159999
# [4,] 2.937763 4.776171 0.19198054
# [5,] 2.208305 2.972943 0.21304488

Now for the pooled estimates. The pooled estimate of the regression coefficients is easy: it's just the mean of the coefficients (first column

(pooledMean <- mean(mat[,1]))

# [1] 3.008602

Calculating the pooled estimate of the standard error is a bit more tricky, but still relatively simple. Total variance is the sum of within-variance and between-variance*degrees-of-freedom-correction.

Within variance is the average of the imputation specific point estimate variances

(withinVar <- mean(mat[,2])) # mean of variances
# [1] 4.459118

Between variance is the variance of the coefficients (variance of first column, or sd of first column squared.

(betweenVar <- sd(mat[,1])^2) # variance of coefficients
# [1] 0.7537916

The degrees of freedom correction is (m+1)/m where m is the number of imputations

(dfCorrection <- (nrow(mat)+1)/(nrow(mat))) # dfCorrection
# [1] 1.2

Now we can calculate total variance

(totVar <- withinVar + betweenVar*dfCorrection) 
# [1] 5.363668

The pooled standard error is just the square root of the total variance

(pooledSE <- sqrt(totVar)) # standard error
# [1] 2.315959

Now is the part I don't know: how to get the pooled estimate of the p-value

(pooledP <- mean(mat[,3])) #??????
# [1] 0.190017

Put them all together

(pooledEstimates <- round(c(pooledMean, pooledSE, pooledP),5))
# [1] 3.00860 2.31596 0.19002

These should be exactly the same as the pooled values for these parameters returned by mice

fit <- with(data=nhimp,exp=lm(chl~bmi))
summary(pool(fit))
#          term   estimate std.error statistic       df   p.value
# 1 (Intercept) 111.958092 61.373512  1.824209 15.93028 0.0869345
# 2         bmi   3.008602  2.315959  1.299074 15.68225 0.2126945

The manually calculated pooled coefficient and se are the same as those yielded by the pool() function; but not the p-value. Can anyone explain simply the way mice calculates the pooled p-value? This post explains how to do it with software but I need to calculate it manually.

llewmills
  • 1,429
  • 12
  • 26
  • 1
    The trick is the degrees of freedom, right? Because it's just the upper tail probability of getting more than the absolute value of a t statistic `pt(q = pooledMean / pooledSE, df = 18.21792, lower.tail = FALSE) * 2`. So you just need to know where 18.21792 comes from? The help file for ?pool says it uses the "Barnard-Rubin adjusted degrees of freedom". – Peter Ellis Feb 07 '18 at 01:46
  • Thank you @Peter Ellis. That was surprisingly simple. Every answer begs another question (as it should), so now I need to work out how to calculate the Barnard-Rubin adjusted degrees of freedom, and that should get me to my final destination. – llewmills Feb 07 '18 at 02:09
  • 1
    This should help: https://github.com/stefvanbuuren/mice/blob/master/R/mice.df.r – Peter Ellis Feb 07 '18 at 02:13
  • Thanks @Peter Ellis. Just had my Eureka moment, and have posted it below. Even though I quit smoking years ago I feel like a cigarette. – llewmills Feb 07 '18 at 03:34
  • 1
    Might betweenVar and withinVar been switched in the opening post? Also asking because of the solution you mentioned. – user297944 Oct 02 '20 at 14:16
  • @llewmills do you have source for the formula Total variance is the sum of between-``` variance and within-variance*degrees-of-freedom-correction``` ? It seems to me that the between and within variances should be exchanged, but am probably mistaken: https://www.missingdata.nl/missing-data/missing-data-methods/multiple-imputation/ – Samuel Saari Feb 18 '22 at 13:40
  • Yes @Samuel Saari you are absolutely right, the formula for pooled se should be as you say `withinVar + betweenVar*dfCorrection`. I mixed up the formula for the within and between variance from @AdamO's answer to the post I linked to. So I got the right answer but just with switch between and within variance labels. I have corrected this error in the original post and my answer below. Thank you for alerting me to this. – llewmills Feb 20 '22 at 11:27
  • Good to hear @Ilewmills. Related, the same question for residual standard error https://stats.stackexchange.com/q/564870/138594 . Would appreciate if you could have a look, especially the betweenVar-variable. Might be of interest to others grabbling with pooling multiple imputation parameters, too. – Samuel Saari Feb 21 '22 at 08:25
  • Sorry I'm not clear on what you want, the source? – llewmills Feb 21 '22 at 22:45
  • @Ilewmills, In the 1st comment was asking for the source for the total variance, and epecially the ```dfCorrection```, but that seems to be sorted out now. In my second comment I wanted to draw attention to my new post that is similar to this one. I thought that you or others dealing with pooling could either help me w or find that thread useful. – Samuel Saari Mar 01 '22 at 11:35

1 Answers1

4

This is for anyone who is interested, after reading pp. 37-43 in Flexible Imputation of Missing Data by Stef van Buuren. If we call the adjusted degrees of freedom nu

  m <- nrow(mat)
  lambda <- (betweenVar + (betweenVar/m))/totVar
  n <- nrow(nhimp$data)
  k <- length(coef(lm(chl~bmi,data = complete(nhimp,1))))
  nu_old <- (m-1)/lambda^2  
  nu_com <- n-k
  nu_obs <- (nu_com+1)/(nu_com+3)*nu_com*(1-lambda)
  (nu_BR <- (nu_old*nu_obs)/(nu_old+nu_obs))
  # [1] 15.68225

nu_BR, the Barnard_Rubin adjusted degrees of freedom, matches up with the degrees of freedom for the bmi variable yielded from the the summary(pool(fit)) call above: 15.68225. So we can pass this value into degrees of freedom argument in the pt() function in order to obtain the two-tailed p-value for the imputed model.

pt(q = pooledMean / pooledSE, df = nu_BR, lower.tail = FALSE) * 2
# [1] 0.2126945

And this manually calculated p-value now matches the p-value from the mice function output.

llewmills
  • 1,429
  • 12
  • 26