I'm doing a multiple regression analysis in R on an epidemiological data set with five predictor variables as follows:
y ~ x1 + x2 + x3 + x4 + x5
With my data (n = 240), I get highly significant p values (< 10^8) for x1
, and p values in the range of 0.09 to 0.35 for the remaining four variables.
Power analysis using pwr.f2.test()
in the pwr
library in R shows that the power is near 1. (Incidentally, allowing factor interactions does not significantly improve the fit. The residuals are Gaussian.)
I need to determine whether we should collect more data and, if so -- this is the crux of the problem -- what the sample size should be in order to provide a power of >=0.90 for each individual predictor variable. For instance, p value for x2
is 0.091. How much additional data do we need to collect to avoid a Type II erorr (at >=0.90 level) for this variable? Or, more generally, how to determine these requisite individual sample sizes (preferably in R)?
I'm considering the following two approaches:
- Use simple linear regression: Model each of remaining variables individually (e.g.,
y ~ xi
, wherexi
refers to x2, x3, x4 or x5 by itself), and use the results as inputs topwr.f2.test()
. - Work off the above multiple regression model: Calculate the partial regression values corresponding to each individual variable from the multiple regression model, and use them as inputs to
pwr.f2.test()
.
Is either of these approaches statistically sound or is there a better way for doing this? Thank you very much in advance.