0

I have a dataset where I have the outcome data, and need to do a power analysis based on a missing covariate.

E.g.:

Surv(time, event) ~ X1 + X2 + ... + Xn + Z

I have all variables (time, event, X1 to Xn) but I am missing Z.

However, I have a previous dataset with the same covariates, including Z, from which I'd like to simulate perform a power analysis for the significance of Z.

How can I properly simulate Z?

Measuring Z is expensive, and if there's very little chance of statistical significance of Z, I'd like to know that ahead of time.

thc
  • 388
  • 2
  • 16
  • What does analysis of your previous data set say about the magnitude and standard error of the coefficient for Z? How many events were there in the previous data set, and how many in your present data set? – EdM Dec 16 '20 at 15:40
  • I expect the new dataset to behave very similarly in all ways to the previous one. You can think of the old and new dataset as sampling from the same population. – thc Dec 16 '20 at 17:26

1 Answers1

0

You say in a comment:

I expect the new dataset to behave very similarly in all ways to the previous one. You can think of the old and new dataset as sampling from the same population.

Thus the best way to "simulate" Z will be to use the information you already have available from analysis of the previous data set.

Most important to start is to assure yourself that omitting Z does not unduly bias the coefficients for the other predictors or affect the predictive performance of the model. Omitted-variable bias is even more of a problem in survival analysis than in ordinary regression, as omitting any predictor associated with outcome can lead to a downward bias, even in coefficients that aren't associated with the omitted predictor (as in logistic regression). With your previous data set, see what happens to the survival model if you omit Z; make sure that omitting the expensive tests for Z aren't a false economy in terms of leading to problems with the model.

As an extra test about the importance of Z, try repeating the modeling with and without Z on multiple bootstrap samples from the previous data, and evaluating model performance in each case against the full original data set. That will both help evaluate sampling-related variability in the modeling (for samples of the size of the original data set) and show the magnitude of any biases or other problems that removing Z might induce.

If you're OK from those respects with omitting Z, and you don't care about the coefficient for Z itself, it's not really clear what you gain by further "simulation" to estimate the power to detect a non-0 coefficient of hypothesized magnitude for Z. If you are curious about how big a coefficient for Z you might be missing in your new data set, note that the standard error of a coefficient estimate (and thus the ability to distinguish its value from 0) is inversely proportional to the number of events in the data set. You can thus take the standard error of the estimate for the Z coefficient from your previous data and analyses, and easily correct for any differences in sample sizes between the previous and new data.

EdM
  • 57,766
  • 7
  • 66
  • 187