Why is least squares performing as well as ridge regression when there is multicollinearity?

Question

I am learning about ridge regression, so I am implementing it in MATLAB as practice. However, I am having trouble finding a structure of data where ridge regression performs better than an ordinary least squares.

Reading up I've found that data that is collinear is often better to be regularized. However when I implemented this in the below code least squares is performing just a well as ridge regression (the best lambda parameter is in the range e-10, almost non-existent!). MATLAB tells me that X is rank deficient (rank=2) when using the built in function for least squares, however it still performs well?

I was wondering if anyone knew why this was performing this way, is my data perhaps not collinear enough to show a real performance difference, or have I misunderstood something?

% Generate data;
clear;
Nt = 100;
X(:,1) = randn(Nt,1);
X(:,2) =  2*X(:,1) + 6;
X(:,3) = 12*X(:,2) + 16;
p=[0.74,3,4.5];
y = X*p' + randn(Nt,1);

% Least Squares;
pLS = X\y;
%pLS = pinv(X'*X)*(X'*y);
nmseN =  sum((X*pLS-y).^2)/length(y)/var(y);

% Tikhonov;
lspace     = logspace(-10,-1,1000);
bestNMSE   = inf;
bestLambda = -1;
I=eye(size(X, 2));
for lambda=1:length(lspace)
  prLS = pinv(X'*X + lspace(lambda)*(I'*I))*(X'*y);

  nmse = sum((X*prLS-y).^2)/length(y)/var(y);
  if nmse<bestNMSE
    bestNMSE=nmse;
    bestLambda=lspace(lambda);
  end
end
prLS = pinv(X'*X + bestLambda*(I'*I))*(X'*y);
nmseR =  sum((X*prLS-y).^2)/length(y)/var(y);

I haven't used MATLAB in years, so I'm not sure, but there may be some issues w/ your code. I did add `;` to the end of some lines. You also seem to have a single quote / apostrophe `'` as a transpose operator, but it seems to be interpreted as a quote instead. Please ensure that your code is / is still correct. — gung - Reinstate Monica, Jun 12 '14 at 13:51
The semi-colons were just missing so that I could see the outputs. On this page the single quote does seem to seen as a quote, however I've put it back into matlab and it is performing the transpose. I don't think there is a problem with the code as such, I think it might be the structure of the data is not appropriate for this situation. — Andrew, Jun 12 '14 at 14:32
**Hint**: Your code appears to be comparing the two using *in-sample* mean-squared error. Why wouldn't OLS be winning that contest?! — cardinal, Jun 12 '14 at 14:49
Ah, I see what you mean! I've now split the data into independent train/test set, and the test NMSE of the regularized solution is much much better than that the normal solution. — Andrew, Jun 12 '14 at 15:41
Disregard my last comment. I took your comment to mean that the error was due to testing on data used for training. After separating the samples, I believed the problem fixed, however I found out that had been due to a error I introduced. Having fixed that, I am still finding that the least squares training error is the same as the regularized least squares training error, and the least squares testing error is the same as the regularized least squares testing error. What I will do now is cross validation on the training data to find the best lambda instead of just a logspace. — Andrew, Jun 12 '14 at 16:07
look at sepal length and width for the Fischer Iris data. Fit the line of discrimination of classes. There are mis-labeled samples. Least squares is not robust, ridge is more robust. Look at cases where robust fitting is important. — EngrStudent, Jul 08 '14 at 18:09
Please do not use inverse on matrix !!!!, use [mldivide](http://www.mathworks.com/help/matlab/ref/mldivide.html) for numerical reasons! [See Don't invert that matrix](https://www.r-bloggers.com/dont-invert-that-matrix-why-and-how/) — Haitao Du, Sep 25 '16 at 04:02

score 2 · Answer 1 · answered Oct 29 '16 at 03:52

To echo Cardinal's comment, ols will always perform the best of any linear method at describing a given data set. Per se, that's the definition of ols. The point of regularized regression is to improve prediction accuracy. For an example of how regularization (and other techniques) can improve predictive accuracy, you could have a look at "Introduction to Statistical Learning with R". In the chapter(6) in which regularized regression is introduced, the use the 'hitters' data set to show various models that are better at prediction than ordinary regression. Chapter 6, labs 1 and 2 go through various methods.

score 1 · Answer 2 · answered Sep 25 '16 at 03:48

This is an old question, but I in case someone else has the same confusion I will try to answer as brief as I can.

The Ordinary Least Squares (OLS) solution simply finds, as an estimate of $P$, the $P'$ that minimizes $|| XP - y ||_2 $. (following your notation in the code)

So when you are only looking at nmse = sum((X*prLS-y).^2)/length(y)/var(y); it will be better than any other estimate, including ridge regression. (Because it is minimizing just that.)

One point to keep in mind is that any regularization added to OLS, including ridge regression, works best for a certain situation. For example if you add a norm 1 penalty $|| XP - y ||_2 + \lambda||P||_1$, it will be more accurate for sparse $P$. I believe it would be more reasonable to look at $||P-P'||_(some-norm)$ here.

Why is least squares performing as well as ridge regression when there is multicollinearity?

2 Answers2

Linked