Calculating pooled p-values from cross validation folds

Question

I want to calculate the pooled p-value of a regression coefficient across K fold cross validation. I have a model

$$Y \sim \mathrm{Intercept} + \mathrm{Cov}_1 + \mathrm{Cov}_2 + \mathrm{Cov_3} + X$$

and I'm interested in the pooled p value estimate of the variable X, after adjusting for the 3 covariates. To this end I perform cross validation, and I fit this logistic regression for each fold.

Following the procedure described in Calculating pooled p-values manually, I get a K×3 matrix, where the first column are the coefficients, the second column are the variances, and the last column are the p values. However, I am not 100% sure that this is applicable (since I'm not doing imputation, I just have K folds, so no variation due to imputation, but only sampling variation). The main problem is that, for some variables X I obtain p values > 1 (after multiplying the t_test result by 2).

Based on the link above I have a few questions:

Is this procedure at all applicable for this task? Even without imputation? If not, is there another way to pool the p value estimates?
I am only using the p values to test for significance, so is it really a problem that it's > 1 (since anyting > 0.05 is considered not significant anyway)?
For n, should I use the complete sample size (N), or the sample size in each fold (~ N * ((K-1) / K))?

score 0 · Accepted Answer · answered Oct 12 '20 at 14:41

I was able to resolve the problem with p-values > 1 by going to the original R code (https://github.com/amices/mice/blob/master/R/mipo.R#L71) which shows that the linked post missed the abs. The final test shoud look like this:

pt(q = abs(pooledMean / pooledSE), df = nu_BR, lower.tail = FALSE) * 2

After talking to some statisticians, we've come to the conclusion that this is a correct approach, though the degrees of freedom might need adjustments. In his write-up of the mice package (https://stefvanbuuren.name/fimd/sec-whyandwhen.html) Stefan says the following:

The calculation of the degrees of freedom cannot be the same as for the complete data because part of the data is missing.

Since in our CV case, the data is indeed complete, we should be alright to just use the original degrees of freedom. Testing this out on our data we got very similar results with both approaches, though the adjusted degrees of freedom generally yielded bigger p-values, being more stringent.

Calculating pooled p-values from cross validation folds

1 Answers1