0

I am trying to predict a single response from twelve explanatory variables. There exist strong correlations between my variables. The correlation matrix looks as follows,

Correlation matrix

and the data have a condition number of 8889.9336. Therefore, I should expect that ordinary linear regression yields suboptimal results. However, it appears to perform rather well:

In [121]: reg = sklearn.linear_model.base.LinearRegression(fit_intercept=True)

In [122]: reg.fit(x[::2, :], y[::2])
Out[122]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [123]: reg.score(x[1::2, :], y[1::2])
Out[123]: 0.99986449992297743

In [124]: print((sqrt((reg.predict(x[1::2, :]).squeeze() - y[1::2])**2).mean()))
0.104017556147

When I use PLS1 including all components, performance is essentially identical to linear regression:

In [130]: reg2 = sklearn.cross_decomposition.PLSRegression(n_components=12, scale=False)

In [131]: reg2.fit(x[::2, :], y[::2])
Out[131]: 
PLSRegression(copy=True, max_iter=500, n_components=12, scale=False,
       tol=1e-06)

In [132]: reg2.score(x[1::2, :], y[1::2])
Out[132]: 0.99986450223986301

In [133]: print((sqrt((reg2.predict(x[1::2, :]).squeeze() - y[1::2])**2).mean()))
0.104024883567

and when I use less components, performance becomes worse:

In [134]: reg3 = sklearn.cross_decomposition.PLSRegression(n_components=9, scale=False)

In [136]: reg3.fit(x[::2, :], y[::2])
Out[136]: PLSRegression(copy=True, max_iter=500, n_components=9, scale=False, tol=1e-06)

In [137]: reg3.score(x[1::2, :], y[1::2])
Out[137]: 0.99979978303748307

In [138]: print((sqrt((reg3.predict(x[1::2, :]).squeeze() - y[1::2])**2).mean()))
0.124467834695

With such correlations (as shown by the figure and the condition number), multiple linear regression should be suboptimal. Yet when I use partial least squares, I get equal or worse results. Why doesn't my PLS1-regression perform better than ordinary linear regression?

The aim of the model is predictive; I am not trying to infer anything from regression coefficients.

In case anybody wants to have a closer look at the data, I have uploaded the explanatory matrix (1760 × 12) in to https://dl.dropboxusercontent.com/u/4650900/x.dat (516 kB) and the response variable to https://dl.dropboxusercontent.com/u/4650900/y.dat (43 kB), both in ASCII.

gerrit
  • 1,397
  • 11
  • 28
  • Why would you expect OLS to "yield suboptimal results"? In what sense do you mean "suboptimal"? Since your focus is on prediction, surely your response to that will refer to some set of expectations about how the values to be predicted are related to the values in your training data. What expectations are those? – whuber Mar 16 '16 at 17:49
  • @whuber I thought that multicollinearity meant that linear regression should be expected to produce wrong results. It seems that textbooks or online sources warn against using standard multiple linear regression in the presence of multicollinearity. Isn't this why other methods such as principle component regression, ridge regression, and partial least squares exist? It would be nice if I can get my RMSE under 0.1K. I need to prevent overfitting too — in the toy example my test data have the same statistics as my training data, but I cannot safely assume this for my application. – gerrit Mar 16 '16 at 18:02
  • 1
    It will not produce "wrong" results at all. The problem is that the standard errors of the coefficient estimates become large and those estimates become strongly correlated. For prediction this is irrelevant--as you have explicitly stated in your question. You only have to make sure--as is the case even with no collinearity--that the data for which you make predictions lie within the range of the data used to fit the model. "Within the range" can be measured in terms of low leverage or, [equivalently,](http://stats.stackexchange.com/questions/199686) short Mahalanobis distance. – whuber Mar 16 '16 at 19:14
  • @whuber That is useful information and clears up a confusion that I had. If you write it as an answer, I can accept it :) – gerrit Mar 16 '16 at 19:17
  • If I understood your code, you tested on the same data that you fit the model. In this case there is never a problem with collinearity. The problem comes when you apply the model to new data that doesn't have exactly the same correlation structure because then the large coefficients induced by the collinear data lead to hugely varying predictions. So I would disagree with @whuber: collinearity is a big problem for prediction but the standard solution is easy, Ridge regression. – seanv507 Mar 17 '16 at 06:55
  • @seanv507 I did not test on the same data. I divided in training data (`x[::2, :]`) and testing data (`x[1::2, :]`), which means training data are indices `(0, 2, 4, ...)` and testing data indices `(1, 3, 5, ...)`. So they are different but drawn from the same overall dataset (12-channel satellite radiometer measurements). I have also tried using measurements taken in different weather conditions for training and testing, but the conclusion remains the same. – gerrit Mar 17 '16 at 12:07
  • How embarassing (I use python/numpy all the time). so the question is whether your coefficients are large (compared to scale of x and y) – seanv507 Mar 17 '16 at 12:27
  • @seanv507 I believe my comments, *which were carefully qualified,* are unimpeachable. All other things being the same (that is, making the usual assumptions about the validity and scope of the underlying model), collinearity will be a problem only when the values of the regressors in the prediction targets have a large Mahalanobis distance from the regressors in the training data set. Where the problems begin is when one undertakes a model selection process in which some regressors are dropped because of collinearity. – whuber Mar 17 '16 at 14:48
  • 1
    @whuber I agree your full comment was unimpeachable, but I think your careful qualifications were lost on the OP: who interpreted it as "nothing to worry about" with collinearity. I would rather say precisely because of collinearity you are likely to have a large Mahalanobis distance for new data because of the tiny eigenvalues associated with collinearity. So rather than test for leverage on new data (and refuse to make a prediction if too high leverage) it would be better to use ridge regression in fitting the model – seanv507 Mar 17 '16 at 20:40

1 Answers1

2

When your number of components in PLSR approaches the number of independent variables in MLR, the two become identical. In other words, If you have 12 predictor variables and use 12 PLSR components, the solutions are identical. The collinearity in MLR leads to overfitting, which does not effect internal error (goodness of fit). The problems with collinearity in MLR appear when you try to use your trained model to make new predictions, as becomes clear through cross-validation or external validation.

woodfoot
  • 76
  • 4
  • Welcome to the site, and thanks for providing such a simple, direct and correct answer to a long-unanswered question. I hope to see many more contributions from you on this site. – EdM Dec 06 '20 at 19:13