64

Consider a good old regression problem with $p$ predictors and sample size $n$. The usual wisdom is that OLS estimator will overfit and will generally be outperformed by the ridge regression estimator: $$\hat\beta = (X^\top X + \lambda I)^{-1}X^\top y.$$ It is standard to use cross-validation to find an optimal regularization parameter $\lambda$. Here I use 10-fold CV. Clarification update: when $n<p$, by "OLS estimator" I understand "minimum-norm OLS estimator" given by $$\hat\beta_\text{OLS} = (X^\top X)^+X^\top y = X^+ y.$$

I have a dataset with $n=80$ and $p>1000$. All predictors are standardized, and there are quite a few that (alone) can do a good job in predicting $y$. If I randomly select a small-ish, say $p=50<n$, number of predictors, I get a reasonable CV curve: large values of $\lambda$ yield zero R-squared, small values of $\lambda$ yield negative R-squared (because of overfitting) and there is some maximum in between. For $p=100>n$ the curve looks similar. However, for $p$ much larger than that, e.g. $p=1000$, I do not get any maximum at all: the curve plateaus, meaning that OLS with $\lambda\to 0$ performs as good as ridge regression with optimal $\lambda$.

enter image description here

How is it possible and what does it say about my dataset? Am I missing something obvious or is it indeed counter-intuitive? How can there be any qualitative difference between $p=100$ and $p=1000$ given that both are larger than $n$?

Under what conditions does minimal-norm OLS solution for $n<p$ not overfit?


Update: There was some disbelief in the comments, so here is a reproducible example using glmnet. I use Python but R users will easily adapt the code.

%matplotlib notebook

import numpy as np
import pylab as plt
import seaborn as sns; sns.set()

import glmnet_python    # from https://web.stanford.edu/~hastie/glmnet_python/
from cvglmnet import cvglmnet; from cvglmnetPlot import cvglmnetPlot

# 80x1112 data table; first column is y, rest is X. All variables are standardized
mydata = np.loadtxt('../q328630.txt')   # file is here https://pastebin.com/raw/p1cCCYBR
y = mydata[:,:1]
X = mydata[:,1:]

# select p here (try 1000 and 100)
p = 1000

# randomly selecting p variables out of 1111
np.random.seed(42)
X = X[:, np.random.permutation(X.shape[1])[:p]]

fit = cvglmnet(x = X.copy(), y = y.copy(), alpha = 0, standardize = False, intr = False, 
               lambdau=np.array([.0001, .001, .01, .1, 1, 10, 100, 1000, 10000, 100000]))
cvglmnetPlot(fit)
plt.gcf().set_size_inches(6,3)
plt.tight_layout()

enter image description here enter image description here

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • You're kidding, right? With a sample of 80 observations and 10-fold CV doesn't that result in 8 observations per fold? Even with *p=10* that's still way too many predictors for traditional OLS. Even PLS would find that challenging. How are these k-folds being created? – Mike Hunter Feb 14 '18 at 18:08
  • 2
    @DJohnson No kidding. Usual 10-fold CV, meaning each training set has n=72 and each test set has n=8. – amoeba Feb 14 '18 at 18:18
  • @BenoitSanchez My own implementation: I write down the formula for ridge estimator in terms of SVD of X. It's only a couple lines of code. – amoeba Feb 14 '18 at 18:19
  • 2
    That is far from a *usual* CV. Given that how could one expect anything like a detectable result? – Mike Hunter Feb 14 '18 at 19:35
  • 1
    What is the rank of $X$ (p=1000)? 80 or less? – Benoit Sanchez Feb 14 '18 at 21:43
  • 4
    @DJohnson I don't understand why are you saying this is far from usual. This is what 10-fold CV is. – amoeba Feb 14 '18 at 22:13
  • That is indeed weird. I can only see two possibilities: (1) as @BenoitSanchez suggested, the rank is very low, or (2) there are exact (or almost exact) copies of the training set samples in the test set. The latter can explain how a model that totally overfits can still perform reasonably well on the test set. The third option is (as always) that there is a bug somewhere in the code. – George Feb 14 '18 at 23:18
  • @George Thanks. Regarding how weird it is -- can you actually find any example of cross-validation for high-dimensional ($n\ll p$) ridge regression on some real data? It would be interesting to see how CV curves "typically" look like in this setting. – amoeba Feb 14 '18 at 23:27
  • Regarding possible bugs, I will post a `cv.glmnet` output tomorrow. – amoeba Feb 14 '18 at 23:37
  • is it possible that the first 500 p are pretty noisy; and the next 500 are very un-noisy? What happens if you shuffle the p before you partition into p < 500, p >= 500? – Hugh Perkins Feb 15 '18 at 02:08
  • @HughPerkins For this experiment, I am selecting $p$ features **randomly** from a larger set (of around 1500 in this case). I always get a plateau for >>500 features and never get a plateau for ~100 features. It's very consistent. – amoeba Feb 15 '18 at 09:40
  • @BenoitSanchez Thanks for your answer attempt anyway. Actually by now I do have a hunch. I am thinking that this might be happening when $y$ is well predicted by the leadings PCs of $X$, let's say by the 1st. Then $\beta_0$ will be close to PC1 direction and PC1 will tend to be preserved between training and test set... – amoeba Feb 15 '18 at 10:02
  • I agree with @DJohnson that the problem's dimensions are worrying. I would strongly suggest using bootstrap for such a small sample. The variability of the CV procedure would be quite significant. (Great question obviously!) – usεr11852 Feb 15 '18 at 10:36
  • What exactly do you mean by bootstrap in this case @usεr11852? (The problem's dimensions are *completely standard* for genomic data btw). – amoeba Feb 15 '18 at 10:37
  • Use bootstrap estimate the error rather than 10-fold CV. (That said the consistency of the shape you report probably shows that the variability is not a huge problem) – usεr11852 Feb 15 '18 at 10:38
  • What is bootstrap estimate of the error? Do you mean holding out random test set of size 8 many times, more than 10? (I wouldn't call it "bootstrap" though.) If so, then I agree: I usually use 100x repeated 10-fold CV, which is similar. @usεr11852 – amoeba Feb 15 '18 at 10:39
  • Cool then. I thought it was a single 10-fold CV. – usεr11852 Feb 15 '18 at 10:41
  • @usεr11852 Yeah, in this Q it *is* actually a single 10-fold CV, but I ran this several times and the general effect that I am talking about was consistent. But in general I would always use repeated CV. – amoeba Feb 15 '18 at 10:42
  • random comment: might be worth editing the questions and answers from the comments into the original question (this comment is directed at amoeba ) – Hugh Perkins Feb 15 '18 at 14:03
  • @HughPerkins I have added the data and the code to the Q. I don't think I can clarify it any further but if you have any suggestions for what I should add, let me know or even better - feel free to edit the Q. – amoeba Feb 15 '18 at 14:25
  • random comment: 0 regularisation solution <> lambda-> 0 solution. as mentioned by @jonnyLomond, any regularised average of the 1000 inputs works equally well.(remembering that the L2 norm encourages lots of small weights)..this assumes as he did correlated inputs... – seanv507 Feb 15 '18 at 23:41
  • @seanv507 What do you mean by "0 regularisation solution <> lambda-> 0 solution"? Why is that? – amoeba Feb 15 '18 at 23:49
  • I meant that with lambda = 0, there is no unique solution and large opposite sign coefficients are valid, but as soon as lambda>0 all those crazy oscillating solution s disappear and you are left with a unique solution, having smallest L2 norm, which in the case of $x_i=y+ \sigma$ (ie noisy copies of target) will be the average of all inputs. – seanv507 Feb 16 '18 at 07:16
  • 2
    @seanv507 I see. Well, I suggest to define "solution with lambda=0" as "minimal-norm solution with lambda=0". I guess my question can be reformulated as follows: *Under what conditions will minimal-norm OLS solution with n

    – amoeba Feb 16 '18 at 08:03
  • 3
    @amoeba: Thank you for this question. It has been extremely instructive and interesting so far. – usεr11852 Feb 16 '18 at 22:51
  • 1
    I wouldn't have pegged you as a Python user :-) – DeltaIV Feb 17 '18 at 23:38
  • 1
    @DeltaIV I've been using Matlab for many years but switched to Python last year... Am still feeling more comfortable with Matlab to be honest (and am occasionally using it for some quick and dirty taks, like here in my answer below), but am doing all the real work in Python now. – amoeba Feb 17 '18 at 23:46
  • I have this expectation that all CV experts are R gurus, which doesn't make sense of course :-) do you have 5 minutes to chat on Ten Fold? I'd like to ask you what resources you used to learn Python – DeltaIV Feb 18 '18 at 00:01
  • @amoeba I'm perhaps veering heavily into off-topic, but as you mentioned you are also a Matlab user, have you considered Julia for numerical computing? It shares many features of Matlab, but I find it way easier to use and deploy, and it's usually faster (as it's compiled and enforce some type stability). I'd consider on par with Cython or Numba for numerical computation (way more optimized than vanilla Python), but it's written mostly in Julia, so you can debug all the way down to the most basic functions, instead of interfacing C or Fortran for faster code. – Firebug Feb 18 '18 at 12:00
  • Now, on topic, I feel the new title give new meaning to the question, and I'd use completely theoretical arguments to try to answer it in opposition to empirical evidences brought up. – Firebug Feb 18 '18 at 14:45
  • @Firebug Re Julia: thanks, this is interesting. I've never seriously considered it to be honest; I had the impression that it's still rather "niche". Perhaps this is changing. Re title: not sure what you mean - are you suggesting that I change the title back? I'd certainly prefer theoretical answers (and that was my intention from the beginning). I changed the title because it seemed that the thread is becoming more general than I expected from the beginning. – amoeba Feb 18 '18 at 15:44
  • @amoeba Hmm, no, it was just an observation. It's just that the first question in the title is easily answered from a theoretical point of view, and the second ties to the first. Given your expertise, I think one could possibly answer both in a few sentences. Should answers try to delve into your empirical experiments? – Firebug Feb 18 '18 at 15:55
  • @Firebug Hmm. There might be some misunderstanding. The new title asks exactly the same thing as before: why does ridge not help when n>>p in the sense that OLS solution is as good as the best ridge solution? How can OLS fail to overfit in this scenario? If you can answer this in a few sentences then please do. – amoeba Feb 18 '18 at 16:28
  • This is a very interesting question! I am not sure I understand the set up. Forgetting for the moment about ridge regression (or lambda = 0), the first diagram suggests that this is a problem where OLS gets better (higher test R2 is good?) the larger we make p, even though n is small. That also seems rather counter-intuitive - have I misunderstood? – Dikran Marsupial Feb 19 '18 at 13:20
  • 1
    @DikranMarsupial Yes, the larger the $p$, the better the performance of OLS ("minimum-norm OLS"). I agree that is counter-intuitive, and this is exactly the main issue here: how can OLS be so good when $p$ is much larger than $n$. That's what second sentence in my title is supposed to convey... My expectation 1 week ago would have been that OLS will necessarily produce absolute garbage whenever $p$ is larger than $n$. – amoeba Feb 19 '18 at 13:42
  • 2
    I suspect use of the normal equations would result in garbage in those circumstances, but the "minimum-norm" is effectively adding an implicit ridge regularisation already (see my answer below) which should at least ameliorate the garbage, but it is unclear why models with fewer inputs require less regularisation. What does LASSO say about the problem (are all features informative)? – Dikran Marsupial Feb 19 '18 at 14:02
  • 2
    @DikranMarsupial Yes, I think the idea that **minimum-norm OLS is effectively performing some shrinkage similar to the ridge** is exactly the key. This I think is what several answers here are converging onto, but I am still confused about the details of how this happens. It seems only to work when $p$ is MUCH larger than $n$. To be honest, I am baffled that this is not described in the literature (I've failed to find anything so far). Re lasso: in my actual application I am using elastic net (and most features are *not* informative); this question has arguably only academic interest :-) – amoeba Feb 19 '18 at 14:10
  • 1
    Just off the top of my head, I think what you are observing is related to one of the original motivations for compressive sensing. Supposedly what happened is that Candes and his student fit an overdetermined system which perfectly fit the data, but constrained the solution to minimize a total variation norm, and were shocked this worked as well as it did. In this case you are constraining the L2 norm instead, essentially getting a different type of minimum-complexity estimator of the coefficients. L2 doesn’t detect sparse signals so I guess this is less interesting as regularization. – guy Feb 19 '18 at 15:21
  • Adding to my previous comment, in their case they were considering, I believe, a noiseless setting, so a solution that fit the data perfectly makes sense as a constraint from their view. Accounting for noise leads to the Dantzig selector. – guy Feb 19 '18 at 15:27
  • @guy That's a very interesting perspective! Thanks a lot. Do you have any citations for this story (about Candes et al being shocked etc.)? – amoeba Feb 19 '18 at 15:48
  • @amoeba Candes said something to this effect during some talk. I think I saw it on youtube, where he is going through the history of compressive sensing. You might be able to find it there. I also recall Terry Tao saying something to the effect that he was also very surprised when Candes came to him with the problem and results. – guy Feb 19 '18 at 16:48
  • One of my Twitter followers posted the following two resources, which look relevant and helpful: (1) http://dx.doi.org/10.2307/1267500 (2) http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf – Jake Westfall Feb 22 '18 at 21:31
  • @Jake These links are about ridge regression via augmenting X and y by fake samples. It is indeed very well-known and standard (and is covered in many threads on our site). This thread is about ridge regression via augmenting X [alone] with pure noise predictors. Completely different story. BTW, I found your tweet and the response; like how your follower dismissively says "well-known". Tell her, come on :) – amoeba Feb 22 '18 at 21:41
  • @amoeba You should reply to the Twitter thread :) – Jake Westfall Feb 22 '18 at 22:44
  • @JakeWestfall I can't: don't have a Twitter :-/ – amoeba Feb 22 '18 at 22:49
  • why are you using $R^2$? Just use L2 loss on test and train. No need for to use $R^2$. – Charlie Parker Feb 27 '18 at 21:17
  • @Pinocchio This is just normalized L2 loss. If you prefer expected error, you can look at the `glmnet` plots that show mean squared error on the y-axis. – amoeba Feb 27 '18 at 22:37
  • 1
    see https://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf – user257566 Sep 19 '19 at 22:47
  • @amoeba how does your paper compare with the theory presented in hastie et al? Of specific interest is how they argue that CV tuned ridge is better than min-2norm ols while you guys argued the opposite. In your view, what causes the discrepancy? – user257566 Sep 27 '19 at 18:40
  • @Baer Do you refer to the paper you posted earlier? They consider beta pointing in the random direction, we consider beta pointing in the PC1 direction of \Sigma. Their results do not necessarily hold for any fixed beta. – amoeba Sep 27 '19 at 18:57
  • 1
    @amoeba Yes, I was. Thank you for explaining! I see the discussion now at the bottom of section 2.3 in your paper. It seems like there's also an interesting argument that real data examples will could roughly have this property. – user257566 Sep 27 '19 at 19:03
  • @Baer Thanks. Yes, exactly. I will eventually include a reference to the Hastie et al. preprint. But what we say about Dobriban & Wager 18 paper should apply to the Hastie et al. too. – amoeba Sep 27 '19 at 19:27
  • 2
    Apparently this thread inspired this recent paper: https://arxiv.org/abs/1805.10939 – Sycorax Apr 17 '20 at 16:52
  • 1
    @SycoraxsaysReinstateMonica Yes it did ;-) I will eventually update the Q to mention that; I've been waiting for the paper to get accepted somewhere first... – amoeba Apr 17 '20 at 18:47

6 Answers6

36

A natural regularization happens because of the presence of many small components in the theoretical PCA of $x$. These small components are implicitly used to fit the noise using small coefficients. When using minimum norm OLS, you fit the noise with many small independent components and this has a regularizing effect equivalent to Ridge regularization. This regularization is often too strong, and it is possible to compensate it using "anti-regularization" know as negative Ridge. In that case, you will see the minimum of the MSE curve appears for negative values of $\lambda$.

By theoretical PCA, I mean:

Let $x\sim N(0,\Sigma)$ a multivariate normal distribution. There is a linear isometry $f$ such as $u=f(x)\sim N(0,D)$ where $D$ is diagonal: the components of $u$ are independent. $D$ is simply obtained by diagonalizing $\Sigma$.

Now the model $y=\beta.x+\epsilon$ can be written $y=f(\beta).f(x)+\epsilon$ (a linear isometry preserves dot product). If you write $\gamma=f(\beta)$, the model can be written $y=\gamma.u+\epsilon$. Furthermore $\|\beta\|=\|\gamma\|$ hence fitting methods like Ridge or minimum norm OLS are perfectly isomorphic: the estimator of $y=\gamma.u+\epsilon$ is the image by $f$ of the estimator of $y=\beta.x+\epsilon$.

Theoretical PCA transforms non independent predictors into independent predictors. It is only loosely related to empirical PCA where you use the empirical covariance matrix (that differs a lot from the theoretical one with small sample size). Theoretical PCA is not practically computable but is only used here to interpret the model in an orthogonal predictor space.

Let's see what happens when we append many small variance independent predictors to a model:

Theorem

Ridge regularization with coefficient $\lambda$ is equivalent (when $p\rightarrow\infty$) to:

  • adding $p$ fake independent predictors (centred and identically distributed) each with variance $\frac{\lambda}{p}$
  • fitting the enriched model with minimum norm OLS estimator
  • keeping only the parameters for the true predictors

(sketch of) Proof

We are going to prove that the cost functions are asymptotically equal. Let's split the model into real and fake predictors: $y=\beta x+\beta'x'+\epsilon$. The cost function of Ridge (for the true predictors) can be written:

$$\mathrm{cost}_\lambda=\|\beta\|^2+\frac{1}{\lambda}\|y-X\beta\|^2$$

When using minimum norm OLS, the response is fitted perfectly: the error term is 0. The cost function is only about the norm of the parameters. It can be split into the true parameters and the fake ones:

$$\mathrm{cost}_{\lambda,p}=\|\beta\|^2+\inf\{\|\beta'\|^2 \mid X'\beta'=y-X\beta\}$$

In the right expression, the minimum norm solution is given by:

$$\beta'=X'^+(y-X\beta )$$

Now using SVD for $X'$:

$$X'=U\Sigma V$$

$$X'^{+}=V^\top\Sigma^{+} U^\top$$

We see that the norm of $\beta'$ essentially depends on the singular values of $X'^+$ that are the reciprocals of the singular values of $X'$. The normalized version of $X'$ is $\sqrt{p/\lambda} X'$. I've looked at literature and singular values of large random matrices are well known. For $p$ and $n$ large enough, minimum $s_\min$ and maximum $s_\max$ singular values are approximated by (see theorem 1.1):

$$s_\min(\sqrt{p/\lambda}X')\approx \sqrt p\left(1-\sqrt{n/p}\right)$$ $$s_\max(\sqrt{p/\lambda}X')\approx \sqrt p \left(1+\sqrt{n/p}\right)$$

Since, for large $p$, $\sqrt{n/p}$ tends towards 0, we can just say that all singular values are approximated by $\sqrt p$. Thus:

$$\|\beta'\|\approx\frac{1}{\sqrt\lambda}\|y-X\beta\|$$

Finally:

$$\mathrm{cost}_{\lambda,p}\approx\|\beta\|^2+\frac{1}{\lambda}\|y-X\beta\|^2=\mathrm{cost}_\lambda$$

Note: it does not matter if you keep the coefficients of the fake predictors in your model. The variance introduced by $\beta'x'$ is $\frac{\lambda}{p}\|\beta'\|^2\approx\frac{1}{p}\|y-X\beta\|^2\approx\frac{n}{p}MSE(\beta)$. Thus you increase your MSE by a factor $1+n/p$ only which tends towards 1 anyway. Somehow you don't need to treat the fake predictors differently than the real ones.

Now, back to @amoeba's data. After applying theoretical PCA to $x$ (assumed to be normal), $x$ is transformed by a linear isometry into a variable $u$ whose components are independent and sorted in decreasing variance order. The problem $y=\beta x+\epsilon$ is equivalent the transformed problem $y=\gamma u+\epsilon$.

Now imagine the variance of the components look like:

enter image description here

Consider many $p$ of the last components, call the sum of their variance $\lambda$. They each have a variance approximatively equal to $\lambda/p$ and are independent. They play the role of the fake predictors in the theorem.

This fact is clearer in @jonny's model: only the first component of theoretical PCA is correlated to $y$ (it is proportional $\overline{x}$) and has huge variance. All the other components (proportional to $x_i-\overline{x}$) have comparatively very small variance (write the covariance matrix and diagonalize it to see this) and play the role of fake predictors. I calculated that the regularization here corresponds (approx.) to prior $N(0,\frac{1}{p^2})$ on $\gamma_1$ while the true $\gamma_1^2=\frac{1}{p}$. This definitely over-shrinks. This is visible by the fact that the final MSE is much larger than the ideal MSE. The regularization effect is too strong.

It is sometimes possible to improve this natural regularization by Ridge. First you sometimes need $p$ in the theorem really big (1000, 10000...) to seriously rival Ridge and the finiteness of $p$ is like an imprecision. But it also shows that Ridge is an additional regularization over a naturally existing implicit regularization and can thus have only a very small effect. Sometimes this natural regularization is already too strong and Ridge may not even be an improvement. More than this, it is better to use anti-regularization: Ridge with negative coefficient. This shows MSE for @jonny's model ($p=1000$), using $\lambda\in\mathbb{R}$:

enter image description here

Benoit Sanchez
  • 7,377
  • 21
  • 43
  • 2
    +1 Very nice, thanks for writing this up. I think it's important to clarify that when you say "regularization" you mean $L_2$ (i.e. ridge) regularization. One would hope that lasso or elastic net can behave better and indeed that is what people are using in $n\gg p$ situations. Nobody uses pure ridge in such a setting and the standard advice is to use regularizations enforcing sparsity; so the behaviour of pure ridge might only have an academic interest. Still, it's pretty amazing that we seem to be discovering it here. Why is this not well-known?? – amoeba Feb 17 '18 at 22:02
  • Also, I don't understand your last two sentences. Do you mean cross-validated R squared? I have an example in my post with n=80 and p=1000 and r-squared around 0.4. I don't see how your claims about r-squared follow from everything above. – amoeba Feb 17 '18 at 22:04
  • Also, see my last comment under Jonny's answer. I am wondering if adding pure noise predictors to the dataset can serve as a regularization strategy similar to ridge regression. This would be SO WEIRD if true. – amoeba Feb 17 '18 at 22:07
  • I am only taking about Ridge sure. Beware I assume all predictors are independent which means there is A LOT of noisy information. This is not the same at all as in Johny's example and probably not yours. Random predictors can't have the effect of regularizating. They lower the values of min-norm OLS $\|\hat\beta\|$ but don't improve the estimation. They degrade it. – Benoit Sanchez Feb 17 '18 at 22:20
  • Still don't understand last two sentences. If they relate to random X, then out-of-sample R-squared should be 0, while in-sample R-squared will be 1. – amoeba Feb 17 '18 at 22:29
  • Regarding random predictors, I am not so sure. If they lower the OLS beta, then it's exactly what shrinkage is supposed to do. – amoeba Feb 17 '18 at 22:30
  • Actually it does work. See an update to my answer. – amoeba Feb 17 '18 at 23:38
  • I did some experiments: $y$ depends on 40 first predictors and I add $p$ white noise predictors. As predicted, min-norm OLS and Ridge on all predictors perform roughly the same for large $p$. They sometimes reach some optimal $R^2$ for some $p$. It always tends towards 0 for large $p$ as predicted. The optimal $R^2$ is still far below Ridge on 40 first predictors. This is not like yours. Only difference: my 40 first predictors are independent. I'm still struggling to understand why this makes such a difference. – Benoit Sanchez Feb 18 '18 at 18:05
  • Sorry I don't fully understand your experiment from this description. Do you maybe want to update your answer with it? How exactly do you generate your `y` and `X`? When you say R^2, do you mean cross-validated R^2? – amoeba Feb 19 '18 at 09:07
  • Updated. I don't use CV since I'm not familiar with it (and I'm a bit paranoid about in this extreme scenario :-). I use theoretical MSE/R2 on a "random new observation". It is possible to compute it easily since the "true" beta is known. – Benoit Sanchez Feb 19 '18 at 11:44
  • Thanks, I will think about it when I have a moment. But when you compute your beta-ridge and beta-ridge-40, what lambda to you use? – amoeba Feb 19 '18 at 11:45
  • 1
    Poorly hard-coded $\lambda$ proportional to $\sigma^2$. Sorry I didn't time for something proper. My main focus was the behaviour of min-norm OLS, to see that it differs from your example and that "some not too bad regularization" on 40 firsts was violently better. – Benoit Sanchez Feb 19 '18 at 11:50
  • I did some experiments. It seems some disagreement between our results is due to two aspects. First, you are using very low noise so that actually OLS solution is about as good as the optimal ridge on the 40 predictors. Try increasing the noise e.g. to $\sigma^2=0.5$. Second, you are computing $\beta-\hat\beta$ over all $p$ dimensions, whereas only 40 dimensions are of interest (the remaining dimensions were "added" for regularization purposes). I have it wrong in the code in my answer now (will fix), but the idea was to always reconstruct $y$ using only 40 predictors. – amoeba Feb 19 '18 at 16:10
  • I can kind of replicate your results (using my cross-validation code) if I use low noise and the whole $\hat\beta$ for reconstruction, but I get much better agreement between "fake predictors" and "ridge" once I increase the noise and use only "truncated" $\hat\beta$. Could you try if it works the same for you? – amoeba Feb 19 '18 at 16:12
  • I'll check. I also got a positive result: instead of using variance 1 fake predictors, I use variance 1E-10 or 1E-20 (results are almost totally insensitive to this value as long as small). Suddenly, for large $p$, the min-norm OLS performs as good as (my) Ridge! It's crazy because this regularization method seems to have essentially no hyper parameter... – Benoit Sanchez Feb 19 '18 at 17:37
  • 3
    I think I've understood the mystery: Ridge regularization with coefficient $\lambda$ is equivalent to min-norm OLS adding $p$ fake predictors each with variance $\lambda/p$ (asymptotically for large $p$). In your data and Johny's model it happens without doing anything thanks to lowest variance components from PCA. Now I need time to find a way to explain this clearly... – Benoit Sanchez Feb 19 '18 at 21:06
  • Looking forward! I can confirm that the smaller the variance of fake predictors, the more of them I need to add to match the best ridge performance (before the performance decreases again; it does decrease if one adds too many fake predictors -- see my updated Part II). Your second statement, about what happens in my/Jonny's data, I did not quite understand. – amoeba Feb 19 '18 at 23:57
  • The proof is really great. But regarding PCA - wait a second. In my actual data X is 80x1000 so there are no 1000 PCs (like on your figure), there is only 80 of them. And all of them have relatively high variance: if I do SVD of X, then the first singular value is around 70, and the rest of the spectrum is between 10 and 60. In the Jonny's simulated data (I take the simulation as implemented in my answer), in the 80x1000 case the first singular value is 66, the rest is between 20 and 40. So yes, $y$ is predicted mainly by the leading PC(s?) but the remaining ones do not have tiny variances. – amoeba Feb 20 '18 at 11:45
  • Yes. I see what you mean. Actually what I refer to is theoretical PCA working on the distribution of $x$ and diagonalizing the theoretical covariance matrix. Somehow it's a theoretical assumption about the distribution of $x$. I have no idea of how to test this assumption experimentally on 80 samples. The argument is crucially about the 1000 theoretical components: the variance is expected to drop after say 40, 100... components, not necesserely before. For a model like @johny, you can really check this is true because the theoretical covariance matrix is known. – Benoit Sanchez Feb 20 '18 at 12:41
  • I see what you mean, but I think we still need an argument that operates on the given data matrix X which has only 80 PCs. I did a simple test: in my "Part I" simulation I reversed the order of 80 singular values of X after doing SVD (in the 80x1000 case). I expected it to have little effect and was going to tell you that this shows that you PCA explanation is incomplete. However it totally removed the effect! After flipping the order of singular values, the minimum-norm OLS asymptote has error > 1 (instead of < 1). – amoeba Feb 21 '18 at 16:14
  • [cont.] I experimented a bit more with rearranging S and the error of min-norm OLS increases gradually, it's not only the position of the 1st PC that matters. If I put the first 10 PCs in the end, the error is still <1 but closer to 1, etc. I feel that this is still something that needs to be understood, e.g. there is something about the PCs of the actual sample data X that influences what happens. – amoeba Feb 21 '18 at 16:17
  • Jonny's model shows that the empirical covariance matrix is very different from the theoretical one that has one eigenvalue equal to $p+s^2$ ($s^2$ is his noise), all the $p-1$ other equal to $s^2$. I agree it would be nice to be able to observe the cause of the phenomenon on the empirical covariance matrix (i.e. the eigenvalues/singular values). Maybe it's just about the smallest eigenvalue: it rather high, regularization is not needed. – Benoit Sanchez Feb 21 '18 at 20:20
  • 1
    I clarified a small point: the coefficients of the fake predictors do not increase the error much (see note at the end of the proof). It's important because in your data/jonny's they are inevitably retained. – Benoit Sanchez Feb 21 '18 at 21:03
  • Yes, this is an important point. Meanwhile, I updated my Part I with some analysis based on singular values. Unfortunately it's still quite hand-wavy... But I really don't think it's relevant to think about $p-n$ eigenvalues in case of Jonny's model: we explicitly consider the case with $n\ll p$ so the whole setting *guarantees* that we only have $n$ eigenvalues. What my experiments do show, is that your intuition that it's important that $y$ is correlated with the PC1 is correct. But what matters are the remaining $n-1$ PCs, not $p-1$ PCs. – amoeba Feb 21 '18 at 22:28
  • [cont.] Maybe one can save/adapt your argument by saying that there is PC1 and if we regress it out from the remaining $p$ predictors, we will have $p-1$ pure noise predictors left. If $p$ is large then your Theorem applies. Here remaining $p$ predictors are not PCs, but the residuals from PC1 (or maybe from several leading PCs). But I am not quite sure why it does not work the same way if $y$ is instead correlated with one of the last PCs. – amoeba Feb 21 '18 at 22:32
  • I clarified that I don't talk of classical (empirical) PCA in my answer. It can't be used as an argument as far as I understand the problem. Theoretical PCA provides a proof for jonny's model. But the relationship between empirical and theoretical PCA is very complex (https://arxiv.org/pdf/0911.3827.pdf) and goes too far me for the moment. I just just a few more years of thinking about it :-) – Benoit Sanchez Feb 22 '18 at 12:02
  • Well, your "theoretical" PCA is just eigenanalysis of the population covariance matrix instead of the sample covariance matrix. The asymptotic theory might be complicated but the setting is pretty clear. I need to think a bit more about your PCA argument. Meanwhile, I put a new 100 bounty that will go to your answer (which remains very undervoted IMHO). Thanks for the link btw, I did not know this paper. – amoeba Feb 22 '18 at 13:00
  • Actually I like your population PCA argument now :) I think it makes sense. I am now thinking about the last paragraph of your answer. Your theorem shows that "noise regularization" will be optimal only in the limit with the variance of fake predictors going to zero. In real life (in my data as well as in Jonny's model) the variance of the "fake" predictors is not zero, e.g. in Jonny's model it is constant. Can it be that this "overshrinks" the beta and *that is* why additional ridge does not help anymore? Beta is already shrunk too much. I think you had a similar argument in an earlier edit. – amoeba Feb 23 '18 at 10:28
  • Yes, but then it seems that in this regime additional ridge penalty will not be able to help at all, which is not what you say in the last paragraph of the answer. In any case I think that this is an important aspect of this phenomenon because it explains the absence of any noticeable minimum on the MSE($\lambda$) curve which is one of the striking features of my original question. Do you maybe want to add something along the lines of your last comment to your answer? – amoeba Feb 23 '18 at 10:52
  • 3
    I tried negative Ridge. I can't believe but it works!!! (and not only on Jonny's model...) – Benoit Sanchez Feb 26 '18 at 15:10
  • Hahaha that's great @Benoit. How exactly did you compute expected MSE for Jonny's model? – amoeba Feb 26 '18 at 15:18
  • Empirical validation on a big generated test set. Close to theoretical MSE (that can be computed exactly, I will do it soon). Note that for the moment, this is done on a single training. I will try avg. on multiple trainings but the minimum seems to always be around -3000. – Benoit Sanchez Feb 26 '18 at 15:23
  • 1
    I am wondering how this can possibly be consistent with the theorem from the original 1970 ridge regression paper that, as far as I understand, states there is always some non-zero (positive) lambda that yields smaller MSE than lambda=0 (see my question from some years ago: https://stats.stackexchange.com/questions/122936). It might be that the theorem only holds when covariance matrix is full rank... – amoeba Feb 26 '18 at 15:28
  • How did you implement negative ridge? Formulation via $(X^\top X + \lambda I)$ does not make sense because this is not invertible when n

    – amoeba Feb 27 '18 at 20:02
  • Just write the diagonal version of $X^\top X$: $n$ positive numbers, then 0s. For example if you take $|\lambda|$ smaller than minimal >0 eigenvalue $X^\top X+\lambda I$ is invertible. – Benoit Sanchez Feb 27 '18 at 20:39
  • No it isn't because 0s will become $-|\lambda|<0$. – amoeba Feb 27 '18 at 20:43
  • Hum... A diagonal matrix is invertible iff all diagonal coefficients are non 0! – Benoit Sanchez Feb 27 '18 at 20:46
  • Oops :-/ Indeed. Anyway, this does not matter because these negative eigenvalues will get multiplied by zeros outside of the inverse. It only matters what happens with non-zero eigenvalues. – amoeba Feb 27 '18 at 20:50
  • 1
    Yep. Call $m$ the smallest eigenvalue. A singularity happens at $\lambda=-m$. I think the "reasonable" range for $\lambda$ is $]-m;+\infty[$. – Benoit Sanchez Feb 28 '18 at 14:24
23

Thanks everybody for the great ongoing discussion. The crux of the matter seems to be that minimum-norm OLS is effectively performing shrinkage that is similar to the ridge regression. This seems to occur whenever $p\gg n$. Ironically, adding pure noise predictors can even be used as a very weird form or regularization.


Part I. Demonstration with artificial data and analytical CV

@Jonny (+1) came up with a really simple artificial example that I will slightly adapt here. $X$ of $n\times p$ size and $y$ are generated such that all variables are Gaussian with unit variance, and correlation between each predictor and the response is $\rho$. I will fix $\rho=.2$.

I will use leave-one-out CV because there is analytical expression for the squared error: it is known as PRESS, "predicted sum of squares". $$\text{PRESS} = \sum_i \left( \frac{e_i}{1-H_{ii}}\right)^2,$$ where $e_i$ are residuals $$e = y - \hat y = y - Hy,$$ and $H$ is the hat matrix $$H = X (X^\top X + \lambda I)^{-1} X^\top=U\frac{S^2}{S^2+\lambda} U^\top$$ in terms of SVD $X=USV^\top$. This allows to replicate @Jonny's results without using glmnet and without actually performing cross-validation (I am plotting the ratio of PRESS to the sum of squares of $y$):

enter image description here

This analytical approach allows to compute the limit at $\lambda\to 0$. Simply plugging in $\lambda=0$ into the PRESS formula does not work: when $n<p$ and $\lambda=0$, the residuals are all zero and hat matrix is the identity matrix with ones on the diagonal, meaning that the fractions in the PRESS equation are undefined. But if we compute the limit at $\lambda \to 0$, then it will correspond to the minimum-norm OLS solution with $\lambda=0$.

The trick is to do Taylor expansion of the hat matrix when $\lambda\to 0$: $$H=U\frac{1}{1+\lambda/S^2} U^\top\approx U(1-\lambda/S^2) U^\top = I - \lambda US^{-2}U^\top = I-\lambda G^{-1}.$$ Here I introduced Gram matrix $G=XX^\top = US^2U^\top$.

We are almost done: $$\text{PRESS} = \sum_i\Big( \frac{\lambda [G^{-1}y]_i}{\lambda G^{-1}_{ii}}\Big)^2 = \sum_i\Big( \frac{ [G^{-1}y]_i}{G^{-1}_{ii}}\Big)^2.$$ Lambda got canceled out, so here we have the limiting value. I plotted it with a big black dot on the figure above (on the panels where $p>n$), and it matches perfectly.

Update Feb 21. The above formula is exact, but we can gain some insight by doing further approximations. It looks like $G^{-1}$ has approximately equal values on the diagonal even if $S$ has very unequal values (probably because $U$ mixes up all the eigenvalues pretty well). So for each $i$ we have that $G^{-1}_{ii}\approx \langle S^{-2} \rangle$ where angular brackets denote averaging. Using this approximation, we can rewrite: $$\text{PRESS}\approx \Big\lVert \frac{S^{-2}}{\langle S^{-2} \rangle}U^\top y\Big\rVert^2.$$ This approximation is shown on the figure above with red open circles.

Whether this will be larger or smaller than $\lVert y \rVert^2 = \lVert U^\top y \rVert^2$ depends on the singular values $S$. In this simulation $y$ is correlated with the first PC of $X$ so $U_1^\top y$ is large and all other terms are small. (In my real data, $y$ is also well predicted by the leading PCs.) Now, in the $p\gg n$ case, if the columns of $X$ are sufficiently random, then all singular values will be rather close to each other (rows approximately orthogonal). The "main" term $U_1^\top y$ will be multiplied by a factor less than 1. The terms towards the end will get multiplied by factors larger than 1 but not much larger. Overall the norm decreases. In contrast, in the $p\gtrsim n$ case, there will be some very small singular values. After inversion they will become large factors that will increase the overall norm.

[This argument is very hand-wavy; I hope it can be made more precise.]

As a sanity check, if I swap the order of singular values by S = diag(flipud(diag(S))); then the predicted MSE is above $1$ everywhere on the 2nd and the 3rd panels.

figure('Position', [100 100 1000 300])
ps = [10, 100, 1000];

for pnum = 1:length(ps)
    rng(42)
    n = 80;
    p = ps(pnum);
    rho = .2;
    y = randn(n,1);
    X = repmat(y, [1 p])*rho + randn(n,p)*sqrt(1-rho^2);

    lambdas = exp(-10:.1:20);
    press = zeros(size(lambdas));
    [U,S,V] = svd(X, 'econ');
    % S = diag(flipud(diag(S)));   % sanity check

    for i = 1:length(lambdas)
        H = U * diag(diag(S).^2./(diag(S).^2 + lambdas(i))) * U';
        e = y - H*y;
        press(i) = sum((e ./ (1-diag(H))).^2);
    end

    subplot(1, length(ps), pnum)
    plot(log(lambdas), press/sum(y.^2))
    hold on
    title(['p = ' num2str(p)])
    plot(xlim, [1 1], 'k--')

    if p > n
        Ginv = U * diag(diag(S).^-2) * U';
        press0 = sum((Ginv*y ./ diag(Ginv)).^2);
        plot(log(lambdas(1)), press0/sum(y.^2), 'ko', 'MarkerFaceColor', [0,0,0]);

        press0approx = sum((diag(diag(S).^-2/mean(diag(S).^-2)) * U' * y).^2);
        plot(log(lambdas(1)), press0approx/sum(y.^2), 'ro');
    end
end

Part II. Adding pure noise predictors as a form of regularization

Good arguments were made by @Jonny, @Benoit, @Paul, @Dikran, and others that increasing the number of predictors will shrink the minimum-norm OLS solution. Indeed, once $p>n$, any new predictor can only decrease the norm of the minimum-norm solution. So adding predictors will push the norm down, somewhat similar to how ridge regression is penalizing the norm.

So can this be used as a regularization strategy? We start with $n=80$ and $p=40$ and then keep adding $q$ pure noise predictors as a regularization attempt. I will do LOOCV and compare it with LOOCV for the ridge (computed as above). Note that after obtaining $\hat\beta$ on the $p+q$ predictors, I am "truncating" it at $p$ because I am only interested in the original predictors.

enter image description here

IT WORKS!!!

In fact, one does not need to "truncate" the beta; even if I use the full beta and the full $p+q$ predictors, I can get good performance (dashed line on the right subplot). This I think mimics my actual data in the question: only few predictors are truly predicting $y$, most of them are pure noise, and they serve as a regularization. In this regime additional ridge regularization does not help at all.

rng(42)
n = 80;
p = 40;
rho = .2;
y = randn(n,1);
X = repmat(y, [1 p])*rho + randn(n,p)*sqrt(1-rho^2);

lambdas = exp(-10:.1:20);
press = zeros(size(lambdas));
[U,S,V] = svd(X, 'econ');

for i = 1:length(lambdas)
    H = U * diag(diag(S).^2./(diag(S).^2 + lambdas(i))) * U';
    e = y - H*y;
    press(i) = sum((e ./ (1-diag(H))).^2);
end

figure('Position', [100 100 1000 300])
subplot(121)
plot(log(lambdas), press/sum(y.^2))
hold on
xlabel('Ridge penalty (log)')
plot(xlim, [1 1], 'k--')
title('Ridge regression (n=80, p=40)')
ylim([0 2])

ps = [0 20 40 60 80 100 200 300 400 500 1000];
error = zeros(n, length(ps));
error_trunc = zeros(n, length(ps));
for fold = 1:n
    indtrain = setdiff(1:n, fold);
    for pi = 1:length(ps)
        XX = [X randn(n,ps(pi))];
        if size(XX,2) < size(XX,1)
            beta = XX(indtrain,:) \ y(indtrain,:);
        else
            beta = pinv(XX(indtrain,:)) * y(indtrain,:);
        end
        error(fold, pi) = y(fold) - XX(fold,:) * beta;
        error_trunc(fold, pi) = y(fold) - XX(fold,1:size(X,2)) * beta(1:size(X,2));
    end
end

subplot(122)
hold on
plot(ps, sum(error.^2)/sum(y.^2), 'k.--')
plot(ps, sum(error_trunc.^2)/sum(y.^2), '.-')
legend({'Entire beta', 'Truncated beta'}, 'AutoUpdate','off')
legend boxoff
xlabel('Number of extra predictors')
title('Extra pure noise predictors')
plot(xlim, [1 1], 'k--')
ylim([0 2])
amoeba
  • 93,463
  • 28
  • 275
  • 317
  • @MartijnWeterings In this experiment, I start with n=80 and p=40. As the total number of predictors (p+q) approaches n=80, the problem becomes ill-conditioned and the OLS solution overfits drastically. There is an enormous peak in the error around q=40. As soon as p+q>n, the "minimum-norm" constraint kicks in and the error starts decreasing but it takes some time until it gets back to where it was with q=0. It happens around q=70, i.e. p+q=130. After that, the error is decreasing even further and this part of the plot is similar to the ridge regression plot. Does it make sense? – amoeba Feb 20 '18 at 11:17
  • @MartijnWeterings On the 1st comment: we are on the same page. On the 2nd comment: in my question I am not truncating beta, that's right. But actually if I don't truncate beta in my simulation (use `y(fold) - XX(fold,:) * beta` instead of `XX(fold,1:size(X,2)) * beta(1:size(X,2))`), then the results don't change too much. I guess I should add this to my answer. I think my original data shows this kind of behaviour. – amoeba Feb 20 '18 at 11:50
  • (1/2): I am still working my way through all of the comments and code to understand, but a thought occurs to me: is there a relationship between this phenomenon we are observing, and the relationship between ridge regression and random effects? – Ryan Simmons Feb 20 '18 at 15:23
  • (2/2): Per Randel's answer here (https://stats.stackexchange.com/questions/122062/unified-view-on-shrinkage-what-is-the-relation-if-any-between-steins-paradox), we see an estimation equivalent between random effects and ridge regression, where lambda is equal to the ratio of the residuals to the variance of the random effect. Here, per Benoit Sanchez's answer, we see that ridge regression is equivalent to adding an arbitrary number of of fake independent predictors each with variance equal to a function of lambda and the number of parameters. It seems to me there is a conceptual relationship. – Ryan Simmons Feb 20 '18 at 15:25
  • @amoeba it was a mistake. adding a scaled vector y to the matrix X does regularize somewhat but not the same as ridge regression or noise vectors. It does however make me wonder what happens when we subtract a bit of $y$ from each x in order to make every variable slightly negatively correlated (or less positive) with the y vector. This in order to perform some 'negative' regularization. That in order to 'undo' the regularization of the 1000 vectors (at some point it may become too much, as you see with the peak/optimum regularization coefficient being now almost out of range). – Sextus Empiricus Feb 23 '18 at 10:08
17

Here is an artificial situation where this occurs. Suppose each predictor variable is a copy of the target variable with a large amount of gaussian noise applied. The best possible model is an average of all predictor variables.

library(glmnet)
set.seed(1846)
noise <- 10
N <- 80
num.vars <- 100
target <- runif(N,-1,1)
training.data <- matrix(nrow = N, ncol = num.vars)
for(i in 1:num.vars){
  training.data[,i] <- target + rnorm(N,0,noise)
}
plot(cv.glmnet(training.data, target, alpha = 0,
               lambda = exp(seq(-10, 10, by = 0.1))))

MSE for various lambda with 100 predictors

100 variables behave in a "normal" way: Some positive value of lambda minimizes out of sample error.

But increase num.vars in the above code to 1000, and here is the new MSE path. (I extended to log(Lambda) = -100 to convince myself.

MSE for various lambda with 1000 predictors

What I think is happening

When fitting a lot of parameters with low regularization, the coefficients are randomly distributed around their true value with high variance.

As the number of predictors becomes very large, the "average error" tends towards zero, and it becomes better to just let the coefficients fall where they may and sum everything up than to bias them toward 0.

I'm sure this situation of the true prediction being an average of all predictors isn't the only time this occurs, but I don't know how to begin pinpoint the biggest necessary condition here.

EDIT:

The "flat" behavior for very low lambda will always happen, since the solution is converging to the minimum-norm OLS solution. Similarly the curve will be flat for very high lambda as the solution converges to 0. There will be no minimum iff one of those two solution is optimal.

Why is the minimum-norm OLS solution so (comparably) good in this case? I think it is related to the following behavior that I found very counter-intuitive, but on reflection makes a lot of sense.

max.beta.random <- function(num.vars){
  num.vars <- round(num.vars)
  set.seed(1846)
  noise <- 10
  N <- 80
  target <- runif(N,-1,1)
  training.data <- matrix(nrow = N, ncol = num.vars)

  for(i in 1:num.vars){
    training.data[,i] <- rnorm(N,0,noise)
  }
  udv <- svd(training.data)

  U <- udv$u
  S <- diag(udv$d)
  V <- udv$v

  beta.hat <- V %*% solve(S) %*% t(U) %*% target

  max(abs(beta.hat))
}


curve(Vectorize(max.beta.random)(x), from = 10, to = 1000, n = 50,
      xlab = "Number of Predictors", y = "Max Magnitude of Coefficients")

abline(v = 80)

Plot of max magnitude of coefficients as number of predictors increases

With randomly generated predictors unrelated to the response, as p increases the coefficients become larger, but once p is much bigger than N they shrink toward zero. This also happens in my example. So very loosely, the unregularized solutions for those problems don't need shrinkage because they are already very small!

This happens for a trivial reason. $y$ can be expressed exactly as a linear combination of columns of $X$. $\hat{\beta}$ is the minimum-norm vector of coefficients. As more columns are added the norm of $\hat{\beta}$ must decrease or remain constant, because a possible linear combination is to keep the previous coefficients the same and set the new coefficients to $0$.

Jonny Lomond
  • 1,208
  • 7
  • 12
  • +1. That's a really nice artificial example!! I am not really convinced by your explanation (second paragraph from the last), but I hope that this simple toy example will provide controlled conditions to study this phenomenon and eventually lead to some insights. In my experience, a lot of things about regression and ridge regression can be better understood via SVD of X, so that's the direction in which I've been thinking (but haven't really figured it out yet). – amoeba Feb 15 '18 at 22:53
  • 1
    (+1). The phenomenon thus seems to happens when predictors are correlated. It does not mean formally that the error curve has no minimum for positive $\lambda$, neither that the limit at 0 is not large. It just means that the curve tends to become flat, and that the threshold for how small $\lambda$ must be for regularization to stop working tends towards 0 for large $p$. Here this threshold goes beyond computational limit but Firebug's answer suggests it may always exist. – Benoit Sanchez Feb 16 '18 at 10:06
  • @BenoitSanchez I don't see any indication in this thread that there is a "threshold". One could use $\lambda=0$ (exactly zero), compute minimum-norm OLS solution, and use CV to estimate MSE. It would be instructive to supplement Jonny's answer with this direct computation, but I am very certain that there will be no difference with what `cv.glmnet` yields for 1e-10 or 1e-100. – amoeba Feb 16 '18 at 10:46
  • It was in Firebug's test with 1E-100 that he removed after your conversation. Yes minimum norm OLS is the way to know for sure. It can be done by projecting 0 orthogonally onto the space of solutions. Unrelated: I wonder how much CV estimates the true theoretical MSE. That can be known if true $\beta$ is known like in this example. – Benoit Sanchez Feb 16 '18 at 12:23
  • @BenoitSanchez There may be a computational limit, but it could also be that the minimum-norm OLS solution is for some reason optimal. I made an edit with a new graph, I wonder if you think it is more or less likely that it is a computational issue. – Jonny Lomond Feb 16 '18 at 12:48
  • 1
    Why do you need `glmnet` in your update? If you only need minimum norm OLS solution then there is a direct formula (see the 2nd formula in my question) and if one computes SVD of $X=USV^\top$ then this formula becomes simply $\hat\beta=VS^{-1}U^\top y$. There is also probably a function in R that computes this solution but I don't really know R :) – amoeba Feb 16 '18 at 13:42
  • Thank you, I edited my code and graph to show the explicit solution. So it's not a computational issue. – Jonny Lomond Feb 16 '18 at 14:25
  • Cool. Very interesting discovery. I need to think about it. – amoeba Feb 16 '18 at 14:37
  • 2
    Thinking about it some more it is not surprising at all. $y$ can be expressed exactly as a linear combination of vectors in $X$. $\hat{\beta}$ is the vector of coefficients with the smallest norm. When you add a vector the norm of $\hat{\beta}$ must decrease or stay the same size, because you could keep the old coefficients the same and set the new ones to zero. – Jonny Lomond Feb 16 '18 at 15:24
  • Makes sense. Meanwhile, I remembered that there is an explicit formula for ridge regression MSE with leave-one-out CV (https://stats.stackexchange.com/questions/32542) and managed to derive the limiting expression for $\lambda\to 0$. I will probably post it as a separate answer a bit later. – amoeba Feb 16 '18 at 16:01
  • An interesting direction would be to see whether the act of choosing the minimum norm OLS solution is "too much" regularization for this problem. You could look at the performance of solutions chosen from a distribution in "OLS solution space" centered on the minimum-norm solution with a CV-chosen variance or boundary. – Jonny Lomond Feb 16 '18 at 16:23
  • 3
    Jonny's example is a good one because it has already been analyzed: see [James-Stein estimator](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). When estimating the mean of a fixed vector $\mathbf{\theta}$ with dimension 3 or greater, we can always improve upon simple averaging by biasing towards zero, which is more or less what ridge regression does. I wonder if perhaps the improvement is too slight to be observed in this case? – Paul Feb 16 '18 at 18:48
  • 1
    @Paul You are right: even for $p=1000$ the minimum is actually at some non-zero lambda (see my answer that I just posted). But it's almost indistinguishable from the minimum-norm OLS solution (asymptote to the left). We are trying to understand the conditions under which this is happening. The last part of Jonny's answer seems to suggest that it might happen *always* when $p\gg n$ which would be a rather extreme statement that I have NEVER seen in the literature. On the other hand, I don't think I've ever seen this issue discussed at all. – amoeba Feb 16 '18 at 22:27
  • No way it happens always, otherwise you could just duplicate columns of $X$ enough times to make an OLS that beats ridge. – Paul Feb 16 '18 at 22:41
  • @Paul but the statement would be that OLS beats ridge on the the set of duplicated columns, not that it would outperform ridge regression on the original set of predictors. That isn't absurd. – Jonny Lomond Feb 17 '18 at 00:07
  • 1
    Whether or not it's absurd, I promise you it's not true. Go ahead and try to get OLS to beat ridge by duplicating columns of a fixed model matrix. You won't go from ridge winning to OLS winning by this method. – Paul Feb 17 '18 at 00:28
  • @Paul, the claim here is not that OLS "wins" or "beats" ridge; the claim is that OLS performs [almost] as good as ridge. That said, I agree with you that duplicating columns of X should not change the OLS performance. The coefficients *will* go down to zero though because they will "spread" across duplicates, but I don't think they will redistribute across the original X variables. I have not tested it though. – amoeba Feb 17 '18 at 08:30
  • I have tested it, Paul is right. However, if the added columns are not quite duplicate but instead correlated (I found good behaviour with correlation ~ 0.98), the behaviour arises. It does not seem to matter if there is any relationship between the target and the large number of correlated variables. – Jonny Lomond Feb 17 '18 at 13:06
  • 1
    Thanks for the new (last) chart. I realized that for the case of independent predictors (correlated or not to the response), the behaviour for large $p$ can be proven. (I did it in my own answer). The chart somehow shows the range of $p$ where regularization is useful. – Benoit Sanchez Feb 17 '18 at 17:54
  • 3
    It is well-known fact that ridge regression is equivalent to adding $p$ extra "fake" samples to the dataset with each sample having $\sqrt{\lambda}$ value in one feature and zeros everywhere else, and all corresponding responses being zero. (It's trivial to rewrite the RR cost function in this form.) I am now wondering if there is a way to add extra *features* (e.g. pure noise?) that would have similar effect. Of course adding extra features means that $\hat\beta$ will increase its dimensionality but one could only look at its values at the "original" predictors. @Paul – amoeba Feb 17 '18 at 21:54
  • @amoeba I was planning on making a similar remark! Adding many noise features will push up the data matrix's lowest singular values by roughly a constant multiple, which is similar to what ridge regression does, but not exactly the same. It's possible that under certain circumstances this odd form of regularization performs slightly better than ridge regression. – Paul Feb 17 '18 at 22:22
  • There are several ridge-like regularization strategies for least squares that work by modifying singular values, and ridge is not always the best one. See for example this paper on truncated SVD: http://infolab.stanford.edu/pub/cstr/reports/na/m/86/36/NA-M-86-36.pdf – Paul Feb 17 '18 at 22:35
  • @Paul This does begin to look plausible, yes. Even though it sounds completely crazy: adding pure noise predictors is a textbook example of increasing overfitting. Of course if we start with $n>p$ then adding pure noise predictors *will* increase overfitting at first; but our hypothesis seems to be that it will start playing regularizing role once there is enough predictors. I think we should test it directly (take a small dataset, use $k$ added pure noise predictors as regularization parameter, and run cross-validation). I am still not quite sure what will happen. – amoeba Feb 17 '18 at 22:40
  • @Paul Regarding your link: isn't it simply [principal component regression](https://en.wikipedia.org/wiki/Principal_component_regression) (PCR)? – amoeba Feb 17 '18 at 22:41
  • Yes, it appears so, although I didn't know it by that name before! I'm still puzzled because this random-added-predictors technique looks like crude ridge regression to me, not something I'd expect to occasionally outperform ridge, like PCR/Truncated SVD. – Paul Feb 17 '18 at 22:57
  • @Paul Not sure why you are puzzled; who says that it will outperform ridge? Personally, I am still not convinced it will increase the performance at all. It might be that it won't, but that after lots of pure noise predictors are added ridge becomes useless. Or maybe it indeed will (but possibly less than ridge regression on the *original* number of features). – amoeba Feb 17 '18 at 23:05
  • My understanding was that Jonny's experiments were showing this outperformance for his particular toy problem, but perhaps I misunderstood. – Paul Feb 17 '18 at 23:10
  • 1
    @Paul I don't think Jonny claimed that. Anyway, I've implemented it. See an update to my answer. It works!! Pretty amazing. – amoeba Feb 17 '18 at 23:37
7

So I decided to run nested cross-validation using the specialized mlr package in R to see what's actually coming from the modelling approach.

Code (it takes a few minutes to run on an ordinary notebook)

library(mlr)
daf = read.csv("https://pastebin.com/raw/p1cCCYBR", sep = " ", header = FALSE)

tsk = list(
  tsk1110 = makeRegrTask(id = "tsk1110", data = daf, target = colnames(daf)[1]),
  tsk500 = makeRegrTask(id = "tsk500", data = daf[, c(1,sample(ncol(daf)-1, 500)+1)], target = colnames(daf)[1]),
  tsk100 = makeRegrTask(id = "tsk100", data = daf[, c(1,sample(ncol(daf)-1, 100)+1)], target = colnames(daf)[1]),
  tsk50 = makeRegrTask(id = "tsk50", data = daf[, c(1,sample(ncol(daf)-1, 50)+1)], target = colnames(daf)[1]),
  tsk10 = makeRegrTask(id = "tsk10", data = daf[, c(1,sample(ncol(daf)-1, 10)+1)], target = colnames(daf)[1])
)

rdesc = makeResampleDesc("CV", iters = 10)
msrs = list(mse, rsq)
configureMlr(on.par.without.desc = "quiet")
bm3 = benchmark(learners = list(
    makeLearner("regr.cvglmnet", alpha = 0, lambda = c(0, exp(seq(-10, 10, length.out = 150))),
    makeLearner("regr.glmnet", alpha = 0, lambda = c(0, exp(seq(-10, 10, length.out = 150))), s = 151)
    ), tasks = tsk, resamplings = rdesc, measures = msrs)

Results

getBMRAggrPerformances(bm3, as.df = TRUE)
#   task.id    learner.id mse.test.mean rsq.test.mean
#1    tsk10 regr.cvglmnet     1.0308055  -0.224534550
#2    tsk10   regr.glmnet     1.3685799  -0.669473387
#3   tsk100 regr.cvglmnet     0.7996823   0.031731316
#4   tsk100   regr.glmnet     1.3092522  -0.656879104
#5  tsk1110 regr.cvglmnet     0.8236786   0.009315037
#6  tsk1110   regr.glmnet     0.6866745   0.117540454
#7    tsk50 regr.cvglmnet     1.0348319  -0.188568886
#8    tsk50   regr.glmnet     2.5468091  -2.423461744
#9   tsk500 regr.cvglmnet     0.7210185   0.173851634
#10  tsk500   regr.glmnet     0.6171841   0.296530437

They do basically the same across tasks.

So, what about the optimal lambdas?

sapply(lapply(getBMRModels(bm3, task.ids = "tsk1110")[[1]][[1]], "[[", 2), "[[", "lambda.min")
# [1] 4.539993e-05 4.539993e-05 2.442908e-01 1.398738e+00 4.539993e-05
# [6] 0.000000e+00 4.539993e-05 3.195187e-01 2.793841e-01 4.539993e-05

Notice the lambdas are already transformed. Some fold even picked the minimal lambda $\lambda = 0$.

I fiddled a bit more with glmnet and discovered neither there the minimal lambda is picked. Check:

EDIT:

After comments by amoeba, it became clear the regularization path is an important step in the glmnet estimation, so the code now reflects it. This way, most discrepancies vanished.

cvfit = cv.glmnet(x = x, y = y, alpha = 0, lambda = exp(seq(-10, 10, length.out = 150)))
plot(cvfit)

enter image description here

Conclusion

So, basically, $\lambda>0$ really improves the fit (edit: but not by much!).

How is it possible and what does it say about my dataset? Am I missing something obvious or is it indeed counter-intuitive?

We are likely nearer the true distribution of the data setting $\lambda$ to a small value larger than zero. There's nothing counter-intuitive about it though.

Edit: Keep in mind, though, the ridge regularization path makes use of previous parameter estimates when we call glmnet, but this is beyond my expertise. If we set a really low lambda in isolation, it'll likely degrade performance.

EDIT: The lambda selection does say something more about your data. As larger lambdas decrease performance, it means there are preferential, i.e. larger, coefficients in your model, as large lambdas shrink all coefficients towards zero. Though $\lambda\neq0$ means that the effective degrees of freedom in your model is smaller than the apparent degrees of freedom, $p$.

How can there be any qualitative difference between p=100 and p=1000 given that both are larger than n?

$p=1000$ invariably contains at least the same of information or even more than $p=100$.


Comments

It seems you are getting a tiny minimum for some non-zero lambda (I am looking at your figure), but the curve is still really really flat to the left of it. So my main question remains as to why λ→0 does not noticeably overfit. I don't see an answer here yet. Do you expect this to be a general phenomenon? I.e. for any data with n≪p, lambda=0 will perform [almost] as good as optimal lambda? Or is it something special about these data? If you look above in the comments, you'll see that many people did not even believe me that it's possible.

I think you're conflating validation performance with test performance, and such comparison is not warranted.

Edit: notice though when we set lambda to 0 after running the whole regularization path performance doesn't degrade as such, therefore the regularization path is key to understand what's going on!

Also, I don't quite understand your last line. Look at the cv.glmnet output for p=100. It will have very different shape. So what affects this shape (asymptote on the left vs. no asymptote) when p=100 or p=1000?

Let's compare the regularization paths for both:

fit1000 = glmnet(x, y, alpha = 0, lambda = exp(seq(-10,10, length.out = 1001)))
fit100 = glmnet(x[, sample(1000, 100)], y, alpha = 0, lambda = exp(seq(-10,10, length.out = 1001)))
plot(fit1000, "lambda")

enter image description here

x11()
plot(fit100, "lambda")

enter image description here

It becomes clear $p=1000$ affords larger coefficients at increasing $\lambda$, even though it has smaller coefficients for asymptotically-OLS ridge, at the left of both plots. So, basically, $p=100$ overfits at the left of the graph, and that probably explains the difference in behavior between them.

It's harder for $p=1000$ to overfit because, even though Ridge shrinks coefficients to zero, they are never reach zero. This mean that the predictive power of the model is shared between many more components, making it easier to predict around the mean instead of being carried away by noise.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • +1 Thanks for doing these experiments! It seems you are getting a tiny minimum for some non-zero lambda (I am looking at your figure), but the curve is still really really flat to the left of it. So my main question remains as to why $\lambda\to 0$ does not noticeably overfit. I don't see an answer here yet. Do you expect this to be a general phenomenon? I.e. for any data with $n\ll p$, lambda=0 will perform [almost] as good as optimal lambda? Or is it something special about these data? If you look above in the comments, you'll see that many people did not even believe me that it's possible. – amoeba Feb 15 '18 at 16:51
  • Also, I don't quite understand your last line. Look at the `cv.glmnet` output for p=100. It will have very different shape. So what affects this shape (asymptote on the left vs. no asymptote) when p=100 or p=1000? – amoeba Feb 15 '18 at 16:52
  • Do you know if `mlr` selects `lambda.min` or `lambda.1se` (in the `cv.glmnet` terminology)? – amoeba Feb 15 '18 at 18:34
  • @amoeba `lambda.min`. There's also a `regr.cvglmnet` learner, which probably allows one to select other rules. – Firebug Feb 15 '18 at 18:42
  • Thanks. To be honest I don't understand the output of your 1e-100 benchmark. E.g. for p=1100 it gives MSE=1.45. But here there is no hyperparameter tuning in the inner loop so basically one does not need inner CV loop at all. Meaning that the result should be the same as with non-nested CV at lambda=1e-100. But we see on the first figure that the MSE there is around 0.7. It does not make sense to me. – amoeba Feb 15 '18 at 20:05
  • @amoeba Hmm, I've ran the `cv.glmnet` with different `lambda` sequences and it does make a difference on the result (that's likely the reason for the 0.7 figure). Now, we know `glmnet` is based on the regularization path problem, so perhaps it's due to numerical rounding at such extreme regularization parameters. – Firebug Feb 15 '18 at 20:25
  • @amoeba about the inner CV, you're correct, that's the reason there's none for that model, it runs the outer CV only. – Firebug Feb 15 '18 at 20:26
  • @amoeba Alright, I added some more edits, TLDR: I think the regularization path makes some difference in model coefficients (code now reflect this). This is starting to creep outside my expertise, though, so perhaps you could pick up from here and add an improved, and perhaps definitive, answer? – Firebug Feb 15 '18 at 21:15
  • After the edits it mostly makes sense to me, but unfortunately I have to say that I still find the whole phenomenon as mysterious as before... In the end of the answer you show that lambda=0 with p=100 overfits (large erratic coefficients) but with p=1000 does not overfit (small coefficients that do not change over a large range of lambdas), but the question remains: WHY EXACTLY is this happening? As I said, the "common wisdom" is that p>n *always* yields bad overfitting... and this seems to be entirely wrong! – amoeba Feb 15 '18 at 22:09
5

How can (minimal norm) OLS fail to overfit?

In short:

Experimental parameters that correlate with the (unknown) parameters in the true model will be more likely to be estimated with high values in a minimal norm OLS fitting procedure. That is because they will fit the 'model+noise' whereas the other parameters will only fit the 'noise' (thus they will fit a larger part of the model with a lower value of the coefficient and be more likely to have a high value in the minimal norm OLS).

This effect will reduce the amount of overfitting in a minimal norm OLS fitting procedure. The effect is more pronounced if more parameters are available since then it becomes more likely that a larger portion of the 'true model' is being incorporated in the estimate.

Longer part:
(I am not sure what to place here since the issue is not entirely clear to me, or I do not know to what precision an answer needs to address the question)

Below is an example that can be easily constructed and demonstrates the problem. The effect is not so strange and examples are easy to make.

  • I took $p=200$ sin-functions (because they are perpendicular) as variables
  • created a random model with $n=50$ measurements.
    • The model is constructed with only $tm=10$ of the variables so 190 of the 200 variables are creating the possibility to generate over-fitting.
    • model coefficients are randomly determined

In this example case we observe that there is some over-fitting but the coefficients of the parameters that belong to the true model have a higher value. Thus the R^2 may have some positive value.

The image below (and the code to generate it) demonstrate that the over-fitting is limited. The dots that relate to the estimation model of 200 parameters. The red dots relate to those parameters that are also present in the 'true model' and we see that they have a higher value. Thus, there is some degree of approaching the real model and getting the R^2 above 0.

  • Note that I used a model with orthogonal variables (the sine-functions). If parameters are correlated then they may occur in the model with relatively very high coefficient and become more penalized in the minimal norm OLS.
  • Note that the 'orthogonal variables' are not orthogonal when we consider the data. The inner product of $sin(ax) \cdot sin(bx)$ is only zero when we integrate the entire space of $x$ and not when we only have a few samples $x$. The consequence is that even with zero noise the over-fitting will occur (and the R^2 value seems to depend on many factors, aside from noise. Of course there is the relation $n$ and $p$, but also important is how many variables are in the true model and how many of them are in the fitting model).

example of over-fitting being reduced

library(MASS)

par(mar=c(5.1, 4.1, 9.1, 4.1), xpd=TRUE)

p <- 200       
l <- 24000
n <- 50
tm <- 10

# generate i sinus vectors as possible parameters
t <- c(1:l)
xm <- sapply(c(0:(p-1)), FUN = function(x) sin(x*t/l*2*pi))

# generate random model by selecting only tm parameters
sel <- sample(1:p, tm)
coef <- rnorm(tm, 2, 0.5)

# generate random data xv and yv with n samples
xv <- sample(t, n)
yv <- xm[xv, sel] %*% coef + rnorm(n, 0, 0.1)

# generate model
M <- ginv(t(xm[xv,]) %*% xm[xv,])

Bsol <- M %*% t(xm[xv,]) %*% yv
ysol <- xm[xv,] %*% Bsol

# plotting comparision of model with true model
plot(1:p, Bsol, ylim=c(min(Bsol,coef),max(Bsol,coef)))
points(sel, Bsol[sel], col=1, bg=2, pch=21)
points(sel,coef,pch=3,col=2)

title("comparing overfitted model (circles) with true model (crosses)",line=5)
legend(0,max(coef,Bsol)+0.55,c("all 100 estimated coefficients","the 10 estimated coefficients corresponding to true model","true coefficient values"),pch=c(21,21,3),pt.bg=c(0,2,0),col=c(1,1,2))

Truncated beta technique in relation to ridge regression

I have transformed the python code from Amoeba into R and combined the two graphs together. For each minimal norm OLS estimate with added noise variables I match a ridge regression estimate with the same (approximately) $l_2$-norm for the $\beta$ vector.

  • It seems like the truncated noise model does much the same (only computes a bit slower, and maybe a bit more often less good).
  • However without the truncation the effect is much less strong.
  • This correspondence between adding parameters and ridge penalty is not necessarily the strongest mechanism behind the absence of over-fitting. This can be seen especially in the 1000p curve (in the image of the question) going to almost 0.3 while the other curves, with different p, don't reach this level, no matter what the ridge regression parameter is. The additional parameters, in that practical case, are not the same as a shift of the ridge parameter (and I guess that this is because the extra parameters will create a better, more complete, model).

  • The noise parameters reduce the norm on the one hand (just like ridge regression) but also introduce additional noise. Benoit Sanchez shows that in the limit, adding many many noise parameters with smaller deviation, it will become eventually the same as ridge regression (the growing number of noise parameters cancel each other out). But at the same time, it requires much more computations (if we increase the deviation of the noise, to allow to use less parameters and speed up computation, the difference becomes larger).

Rho = 0.2 comparing truncated noise with ridge regression

Rho = 0.4 comparing truncated noise with ridge regression

Rho = 0.2 increasing the variance of the noise parameters to 2 comparing truncated noise with ridge regression

code example

# prepare the data
set.seed(42)
n = 80
p = 40
rho = .2
y = rnorm(n,0,1)
X = matrix(rep(y,p), ncol = p)*rho + rnorm(n*p,0,1)*(1-rho^2)

# range of variables to add
ps = c(0, 5, 10, 15, 20, 40, 45, 50, 55, 60, 70, 80, 100, 125, 150, 175, 200, 300, 400, 500, 1000)
#ps = c(0, 5, 10, 15, 20, 40, 60, 80, 100, 150, 200, 300) #,500,1000)

# variables to store output (the sse)
error   = matrix(0,nrow=n, ncol=length(ps))
error_t = matrix(0,nrow=n, ncol=length(ps))
error_s = matrix(0,nrow=n, ncol=length(ps))

# adding a progression bar
pb <- txtProgressBar(min = 0, max = n, style = 3)

# training set by leaving out measurement 1, repeat n times 
for (fold in 1:n) {
    indtrain = c(1:n)[-fold]

    # ridge regression
    beta_s <- glmnet(X[indtrain,],y[indtrain],alpha=0,lambda = 10^c(seq(-4,2,by=0.01)))$beta
    # calculate l2-norm to compare with adding variables
    l2_bs <- colSums(beta_s^2)

    for (pi in 1:length(ps)) {
        XX = cbind(X, matrix(rnorm(n*ps[pi],0,1), nrow=80))
        XXt = XX[indtrain,]

        if (p+ps[pi] < n) {
            beta = solve(t(XXt) %*% (XXt)) %*% t(XXt) %*% y[indtrain]
        }
        else {
            beta = ginv(t(XXt) %*% (XXt)) %*% t(XXt) %*% y[indtrain]
        }

        # pickout comparable ridge regression with the same l2 norm      
        l2_b <- sum(beta[1:p]^2)
        beta_shrink <- beta_s[,which.min((l2_b-l2_bs)^2)] 

        # compute errors
        error[fold, pi] = y[fold] - XX[fold,1:p] %*% beta[1:p]
        error_t[fold, pi] = y[fold] - XX[fold,] %*% beta[]
        error_s[fold, pi] = y[fold] - XX[fold,1:p] %*% beta_shrink[]
    }
    setTxtProgressBar(pb, fold) # update progression bar
}

# plotting
plot(ps,colSums(error^2)/sum(y^2) , 
     ylim = c(0,2),
     xlab ="Number of extra predictors",
     ylab ="relative sum of squared error")
lines(ps,colSums(error^2)/sum(y^2))
points(ps,colSums(error_t^2)/sum(y^2),col=2)
lines(ps,colSums(error_t^2)/sum(y^2),col=2)
points(ps,colSums(error_s^2)/sum(y^2),col=4)
lines(ps,colSums(error_s^2)/sum(y^2),col=4)

title('Extra pure noise predictors')

legend(200,2,c("complete model with p + extra predictors",
               "truncated model with p + extra predictors",
               "ridge regression with similar l2-norm",
               "idealized model uniform beta with 1/p/rho"),
       pch=c(1,1,1,NA), col=c(2,1,4,1),lt=c(1,1,1,2))

# idealized model (if we put all beta to 1/rho/p we should theoretically have a reasonable good model)
error_op <- rep(0,n)
for (fold in 1:n) {
  beta = rep(1/rho/p,p)
    error_op[fold] = y[fold] - X[fold,] %*% beta
}
id <- sum(error_op^2)/sum(y^2)
lines(range(ps),rep(id,2),lty=2)
Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
1

If you're familiar with linear operators then you may like my answer as most direct path to understanding the phenomenon: why doesn't least norm regression fail outright? The reason is that your problem ($n\ll p$) is the ill posed inverse problem and pseudo-inverse is one of the ways of solving it. Regularization is an improvement though.

This paper is probably the most compact and relevant explanation: Lorenzo Rosasco et al, Learning, Regularization and Ill-Posed Inverse Problems. They set up your regression problem as learning, see Eq.3., where the number of parameters exceeds the number of observations: $$Ax=g_\delta,$$ where $A$ is a linear operator on Hilbert space and $g_\delta$ - noisy data.

Obviously, this is an ill-posed inverse problem. So, you can solve it with SVD or Moore-Penrose inverse, which would render the least norm solution indeed. Thus it should not be surprising that your least norm solution is not failing outright.

However, if you follow the paper you can see that the ridge regression would be an improvement upon the above. The improvement is really a better behavior of the estimator, since Moore-Penrose solution is not necessarily bounded.

UPDATE

I realized that I wasn't making it clear that ill-posed problems lead to overfitting. Here's the quote from the paper Gábor A, Banga JR. Robust and efficient parameter estimation in dynamic models of biological systems. BMC Systems Biology. 2015;9:74. doi:10.1186/s12918-015-0219-2:

The ill-conditioning of these problems typically arise from (i) models with large number of parameters (over-parametrization), (ii) experimental data scarcity and (iii) significant measurement errors [19, 40]. As a consequence, we often obtain overfitting of such kinetic models, i.e. calibrated models with reasonable fits to the available data but poor capability for generalization (low predictive value)

So, my argument can be stated as follows:

  • ill posed problems lead to overfitting
  • (n < p) case is an extremely ill-posed inverse problem
  • Moore-Penrose psudo-inverse (or other tools like SVD), which you refer to in the question as $X^+$, solves an ill-posed problem
  • therefore, it takes care of overfitting at least to some extent, and it shouldn't be surprising that it doesn't completely fail, unlike a regular OLS should

Again, regularization is a more robust solution still.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • 1
    (+1) Thanks, but I don't quite see how this paper is relevant. I will look at it tomorrow in more detail. Where exactly do they say that minimum norm OLS solution will not overfit or that minimum norm requirement can be seen as regularization? – amoeba Feb 21 '18 at 22:35
  • 1
    Let’s discuss when you read the paper. They don’t say psudo inverse is regularization. What they say is that it is the solution to the ill posed problem. What I’m saying is that overfitting is due to ill posed ness of the problem, so by addressing the latter you take care of the former albeit not as well as with regularization. – Aksakal Feb 21 '18 at 22:56
  • Re Update: the first three bullet points are completely clear to me and form the premise of my question. But I don't think that the fourth bullet point follows from them; I don't understand your "therefore". I can easily come up with *another* way of solving this ill-posed problem (e.g. find beta with the norm that is 10e+6 larger than the minimum-norm solution) that will *not* take care of overfitting. The question is why $X^+$ does. – amoeba Feb 22 '18 at 08:38
  • 1
    I think the puzzling thing is not that the minimum norm solution does not ameliorate over-fitting to some extent, but that adding more regularisation doesn't improve things further. Also why the minimum norm solution is more effective as the number of features grows larger. My intuition is that problems with more parameters need more regularisation (all things being otherwise equal) rather than less. This is a really interesting problem, and may help explan why e.g. even unregularised neural networks do not over-fit as much as you might expect. – Dikran Marsupial Feb 22 '18 at 11:50
  • @amoeba, so your question really is what causes overfitting? because you don't agree that ill posedness is the cause, you seem to state that it's just a necessary condition – Aksakal Feb 22 '18 at 12:28
  • Not sure this was said explicitly in this lengthy discussion: crossvalidation does not guarantee model is not overfit. Moreover, the mere desing of crossvalidation welcomes overfitting through multiple model selection. The model selection bias is always peresnt, especially if one has so few observations compare to so many CV folds. You can show that in practice any model can be fitted well to the CV validation folds. – Alexey Burnakov Feb 22 '18 at 12:33
  • 1
    @Dikran Actually *other* forms or regularization can still improve the performance: e.g. I can improve the performance (compared to the minimum-norm OLS) with principal component regression or with elastic net. It's just that ridge regularization becomes useless. The analogy to neural networks is a fascinating thought that hasn't crossed my mind. What I *did* think about recently though, is that no wonder nobody understands why tricky deep learning things like batch normalization really work, given that even linear ridge regression from Statistics 101 can be so puzzling :-) – amoeba Feb 22 '18 at 12:53
  • 2
    This is not the main question at hand but I think that this otherwise excellent series of questions, answers, and comments got sidetracked from bringing in cross-validation. For this discussion it would be far simpler to compute the population linear predictor from the $\beta$s that were used to simulate the data, and to compute MSE of any estimator of the linear predictor. And I have seen a case where for $n << p$ I could not find an optimum corrected AIC for ridge regression (R`rms` package `ols` function). But I need to re-run that using the true linear predictor as the gold standard. – Frank Harrell Feb 22 '18 at 13:54
  • @FrankHarrell This is a good point: in simulations it makes more sense to use ground truth than cross-validation. I will think how to change my answer accordingly. – amoeba Feb 22 '18 at 21:49