I am trying to predict age as a function of a set of DNA methylation markers. These predictors are continuous between 0 and 100. When performing OLS regression, I can see that variance increases with age.
Thus, I decided to fit a weighted regression model. However, I am having trouble deciding how to define the weights for my model. I have used the fGLS method, like so:
OLSressq <- OLSres^2 # Square residuals
lnOLSressq <- log(OLSressq) # Take natural log of squared residuals
aux <- lm(lnOLSressq~X) # Run auxillary model
ghat <- fitted(aux) # Predict g^
hhat <- exp(ghat) # Create h^
fGLS <- lm(Y~X, weights = 1/hhat) # Weight is 1/h^
And these were my results:
Call:
lm(formula = Y ~ X, weights = 1/hhat)
Weighted Residuals:
Min 1Q Median 3Q Max
-4.9288 -1.2491 -0.1325 1.2626 5.1452
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.1009494 5.2299867 4.417 1.64e-05 ***
XASPA -0.1441404 0.0474738 -3.036 0.00271 **
XPDE4C 0.6421385 0.0812891 7.899 1.83e-13 ***
XELOVL2 -0.2040382 0.0866564 -2.355 0.01951 *
XELOVL2sq 0.0088532 0.0009381 9.438 < 2e-16 ***
XEDARADD -0.1965472 0.0348989 -5.632 5.98e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.762 on 200 degrees of freedom
Multiple R-squared: 0.9687, Adjusted R-squared: 0.9679
F-statistic: 1239 on 5 and 200 DF, p-value: < 2.2e-16
However, before figuring out how to perform the fGLS method, I was playing around with different weights just to see what would happen. I used 1/(squared residuals of OLS model) as weights and ended up with this:
Call:
lm(formula = Y ~ X, weights = 1/OLSressq)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.0893 -0.9916 -0.7855 0.9998 2.0238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.8756737 1.1355861 27.19 <2e-16 ***
XASPA -0.1956188 0.0116329 -16.82 <2e-16 ***
XPDE4C 0.6168490 0.0102149 60.39 <2e-16 ***
XELOVL2 -0.1596969 0.0116723 -13.68 <2e-16 ***
XELOVL2sq 0.0078459 0.0001593 49.26 <2e-16 ***
XEDARADD -0.2492048 0.0068751 -36.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1 on 200 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.133e+06 on 5 and 200 DF, p-value: < 2.2e-16
Since the residual standard error is smaller, R² equals 1 (is that even possible?) and the F statistic is a lot higher, I am tempted to assume this model is better than what I achieved through the fGLS method. However, it seems to me that randomly picking weights through trial and error should always yield worse results than when you actually mathematically try to estimate the correct weights.
Can someone give me some advice on which weights to use for my model? I have also read here and there that you cannot interpret R² in the same way you would when performing OLS regression. But then how should it be interpreted and can I still use it to somehow compare my WLS model to my OLS model?