9

In ordinary least squares, regressing a target vector $y$ against a set of predictors $X$, the hat matrix is computed as

$$H = X (X^tX)^{-1} X^t$$

and the PRESS (predicted residual sum of squares) is calculated by

$$SS_P = \sum_i \left( \frac{e_i}{1-h_{ii}}\right)^2$$

where $e_i$ is the $i$th residual and the $h_{ii}$ are the diagonal elements of the hat matrix.

In ridge regression with penalty coefficient $\lambda$, the hat matrix is modified to be

$$H = X (X^t X + \lambda I)^{-1} X^t$$

Can the PRESS statistic be calculated in the same way, using the modified hat matrix?

Chris Taylor
  • 3,432
  • 1
  • 25
  • 29

2 Answers2

7

yes, I use this method a lot for kernel ridge regression, and it is a good way of selecting the ridge parameter (see e.g. this paper [doi,preprint]).

A search for the optimal ridge parameter can be made very efficient if the computations a performed in canonical form (see e.g. this paper), where the model is re-parametersied so that the inverse of a diagonal matrix is required.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • Thanks. In your experience, if you use PRESS to select the ridge parameter, how does your actual prediction error on a test set compare with your measured PRESS on the training set? Presumably (PRESS / n) is an underestimate of the prediction error, but is it reliable in practice? – Chris Taylor Jul 18 '12 at 13:27
  • 1
    PRESS is approximately unbiased, the real problem with it is the variance, which means that there is a lot of variability depending on the particular sample of data on which it is evaluated. This means that if you optimise PRESS in model selection, you can over-fit the model selection criterion and end up with a poor model. However for the type of model I am interested in (kernel learning methods) it is pretty effective and the variance problem doesn't seem to be much worse than other criterion that might be expected to work better. – Dikran Marsupial Jul 18 '12 at 13:32
  • If in doubt, you can always use bagging in addition to ridge regression as a sort of "belt-and-braces" approach to avoiding over-fitting. – Dikran Marsupial Jul 18 '12 at 13:33
  • Thanks for your help! I was under the impression that bagging didn't give any improvement in linear models, e.g. as claimed in the [Wikipedia article](http://en.wikipedia.org/wiki/Bootstrap_aggregating)? Can you clarify? – Chris Taylor Jul 18 '12 at 14:27
  • no problem. I suspect the Wikipedia article is incorrect, subset selection in linear regression is one of the examples Brieman uses in the original paper on Bagging. It is possible that least-squares linear regression without subset selection is assymptotically unaffected by bagging, but even then I doubt it applies to linear models more generally (such as logistic regression). – Dikran Marsupial Jul 18 '12 at 16:03
  • is it possible to use PRESS just as a validation method for a recursive least squares solution? I'm not even sure anymore if I can use it just as a validation method without using it for parameter selection. My plan is: Use recursive least squares (in a simplified scheme with lambda = 1) to estimate parameters. use PRESS on the output to validate the model. Is this possible or completely nonsense? @DikranMarsupial – Verena Jun 15 '16 at 12:46
  • Do you know the PRESS formula for ridge regression assuming that intercept is not penalized? I see that you derive something similar (but more complicated) in your linked paper so I am wondering if you know the result for the simple linear ridge regression case. – amoeba Feb 26 '18 at 21:35
  • @amoeba, no, I don't think so, if it extends to the kernel case I am always looking for simpler/better ways to do things! – Dikran Marsupial Feb 27 '18 at 13:15
  • @DikranMarsupial Sorry, I probably gave the wrong impression in my comment: I did not mean that I know this formula. It's just that it *seems* that you are deriving a more general result in your paper so I was wondering if this result is applicable to the non-kernel case and perhaps the formula becomes even simpler then. I am also wondering if this is derived somewhere in the literature. I tried to do my own derivation in the morning and I think I have it but it's a bit messy and I did not test it yet. – amoeba Feb 27 '18 at 13:24
  • @amoeba I haven't seen a version for unregularised bias, but I suspect it can be done. I'll have a think about it (will need to remember the kernel version first! ;o) – Dikran Marsupial Feb 27 '18 at 13:53
0

The following approach can be taken to apply L2 regularisation and get the PRESS statistic. The method uses a data augmentation approach.

Assume you have N samples of Y, and K explanatory variables X1,X2...Xk....XK

  1. Add additional variable X0 that has 1 over the N samples
  2. Augment with K additional samples where:
    • Y value is 0 for each of the K samples
    • X0 value is 0 for each of the K samples
    • Xk value is SQRT(Lambda * N) * [STDEV(Xk) over N samples] if on diagonal, and 0 otherwise
  3. There are now N+K samples and K+1 variables. A normal linear regression can be solved with these inputs.
  4. As this a regression done in one step the PRESS statistic can be calculated as normal.
  5. The Lambda regularisation input has to be decided. Reviewing the PRESS statistic for different inputs of Lambada can help determine a suitable value.
James65
  • 49
  • 7
  • May I ask you to please take a look at https://stats.stackexchange.com/questions/476935/the-best-way-to-compute-the-press-statistic and at https://stats.stackexchange.com/questions/476913/what-assumptions-must-be-satisfied-to-use-r2-to-compute-the-f-statistic – BillB Jul 13 '20 at 23:03