9

I am trying to predict age as a function of a set of DNA methylation markers. These predictors are continuous between 0 and 100. When performing OLS regression, I can see that variance increases with age.

Thus, I decided to fit a weighted regression model. However, I am having trouble deciding how to define the weights for my model. I have used the fGLS method, like so:

OLSressq <- OLSres^2                 # Square residuals
lnOLSressq <- log(OLSressq)          # Take natural log of squared residuals
aux <- lm(lnOLSressq~X)              # Run auxillary model
ghat <- fitted(aux)                  # Predict g^
hhat <- exp(ghat)                    # Create h^
fGLS <- lm(Y~X, weights = 1/hhat)    # Weight is 1/h^

And these were my results:

Call:
lm(formula = Y ~ X, weights = 1/hhat)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-4.9288 -1.2491 -0.1325  1.2626  5.1452 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.1009494  5.2299867   4.417 1.64e-05 ***
XASPA       -0.1441404  0.0474738  -3.036  0.00271 ** 
XPDE4C       0.6421385  0.0812891   7.899 1.83e-13 ***
XELOVL2     -0.2040382  0.0866564  -2.355  0.01951 *  
XELOVL2sq    0.0088532  0.0009381   9.438  < 2e-16 ***
XEDARADD    -0.1965472  0.0348989  -5.632 5.98e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.762 on 200 degrees of freedom
Multiple R-squared:  0.9687,    Adjusted R-squared:  0.9679 
F-statistic:  1239 on 5 and 200 DF,  p-value: < 2.2e-16

However, before figuring out how to perform the fGLS method, I was playing around with different weights just to see what would happen. I used 1/(squared residuals of OLS model) as weights and ended up with this:

Call:
lm(formula = Y ~ X, weights = 1/OLSressq)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-1.0893 -0.9916 -0.7855  0.9998  2.0238 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.8756737  1.1355861   27.19   <2e-16 ***
XASPA       -0.1956188  0.0116329  -16.82   <2e-16 ***
XPDE4C       0.6168490  0.0102149   60.39   <2e-16 ***
XELOVL2     -0.1596969  0.0116723  -13.68   <2e-16 ***
XELOVL2sq    0.0078459  0.0001593   49.26   <2e-16 ***
XEDARADD    -0.2492048  0.0068751  -36.25   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 200 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.133e+06 on 5 and 200 DF,  p-value: < 2.2e-16

Since the residual standard error is smaller, R² equals 1 (is that even possible?) and the F statistic is a lot higher, I am tempted to assume this model is better than what I achieved through the fGLS method. However, it seems to me that randomly picking weights through trial and error should always yield worse results than when you actually mathematically try to estimate the correct weights.

Can someone give me some advice on which weights to use for my model? I have also read here and there that you cannot interpret R² in the same way you would when performing OLS regression. But then how should it be interpreted and can I still use it to somehow compare my WLS model to my OLS model?

I. Smeers
  • 111
  • 1
  • 2
  • 9
  • 2
    I'd be very cautious about that `R-squared = 1`. Do you have an idea of what the potential weights should be? Sounds like `1/(squared residuals of OLS model)` was just a guess. If you're in the dark about the weights, I suggest using GLS or Iterative Weighted Least Squares. – Jon Nov 15 '16 at 17:01
  • I have to add, that when fitting the same model to a training set (half of my original data), that R-squared went down from 1 to 0,9983. It was indeed just a guess, which is why I eventually used fGLS as described in the above. Is that what you mean by "I suggest using GLS"? I am just confused as to why it seems that the model I made by just guessing the weights is a better fit than the one I made by estimating the weights throug fGLS. I have not yet heard of Iterative Weighted Least Squares, but I will look into it. Thank you. – I. Smeers Nov 15 '16 at 17:27
  • @Jon, feasible GLS requires you to specify the weights (while infeasible GLS which uses theoretically optimal weights is not a feasible estimator, i.e. it cannot be used in practice). – Richard Hardy Nov 15 '16 at 18:11
  • Yes, that's correct. They could however specify the correlation structure in the `nlme::gls` function. The `nlme::corClasses` provides a list of different correlation structures. – Jon Nov 19 '16 at 20:42

2 Answers2

2

There are two issues here

  1. You would, ideally, use weights inversely proportional to the variance of the individual $Y_i$. So says the Gauss-Markov Theorem.

  2. You don't know the variance of the individual $Y_i$

If you have deterministic weights $w_i$, you are in the situation that WLS/GLS are designed for. One traditional example is when each observation is an average of multiple measurements, and $w_i$ the number of measurements.

If you have weights that depend on the data through a small number of parameters, you can treat them as fixed and use them in WLS/GLS even though they aren't fixed. For example, you could estimate $\sigma^2(\mu)$ as a function of the fitted $\mu$ and use $w_i=1/\sigma^2(\mu_i)$ -- this seems to be what you are doing in the first example. This is also what happens in linear mixed models, where the weights for the fixed-effects part of the model depend on the variance components, which are estimated from the data.

In this scenario it is possible to prove that although there is some randomness in the weights, it does not affect the large-sample distribution of the resulting $\hat\beta$. It's ok to treat the $w_i$ as if they were known in advance.

If you have weights that are not nearly deterministic, the whole thing breaks down and the randomness in the weights becomes important for both bias and variance. That's what happens in your second example, when you use $w_i=1/r_i^2$. It's an obvious thing to think of, but it doesn't work. The estimating equations (normal equations, score equations) for $\hat\beta$ are $$\sum_i x_iw_i(y_i-x_i\beta)=0$$ With that choice of weights, you get $$\sum_i x_i\frac{(y_i-x_i\beta)}{(y_i-x_i\hat\beta^*)^2}=0$$ where $\hat\beta^*$ is the unweighted estimate. If the new estimate is close to the old one (which should be true for large data sets, because both are consistent), you'd end up with equations like $$\sum_i x_i\frac{1}{(y_i-x_i\beta)}=0$$ which divides by a variable with mean zero, a bad sign.

So:

It's ok to estimate the weights if you have a good mean model (so that the squared residuals are approximately unbiased for the variance) and as long as you don't overfit them. If you do overfit them, you will get a bad estimate of $\beta$ and inaccurate standard errors.

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
0

Why are you using FLGS? Have you got heteroscedasticity and correlation between the residuals? And is the matrix var-cov matrix unknown? Try bptest(your_model) and if the p-value is less the alpha (e.g., 0.05) there is heteroscedasticity. And then you should try to understand if there is correlation between the residuals with a Durbin Watson test: dwtest(your_model), if the statistic W is between 1 and 3, then there isn't correlation. So if you have only heteroscedasticity you should use WLS, like this:

mod_lin <- lm(Price~Weight+HP+Disp., data=df)
wts     <- 1/fitted( lm(abs(residuals(mod_lin))~fitted(mod_lin)) )^2
mod2    <- lm(Price~Weight+HP+Disp., data=df, weights=wts)

So mod2 is with the old model, now with WLS.

R-square = 1, it's too weird. Maybe there is collinearity.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 1
    Why would a D-W test be appropriate. I think of it as only used for auto-correlation and I don't see how that would apply in this case. – meh Mar 21 '19 at 14:57
  • 1
    Welcome to xvalidated! Please specify from which package functions `bptest` and `dwtest` come from as they are not part of the standard R distribution. – Helix123 Mar 21 '19 at 17:18
  • Because you need to understand which estimator is the best: like wls, fgls, ols ect.. – Lorenzo Famiglini Mar 22 '19 at 15:12