Can gradient descent find a better solution than least squares regression?

Question

Suppose I want to regress from an N-dimensional space to a 1-dimensional variable. I know that we can calculate the regression matrix with $\beta = (\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{y}$ , and another alternative is optimizing $\beta$ in the N-dimensional parameter space with a grid-search, or a gradient descent method.

My question is, (for the linear case), the least-squares method solves for the best $\beta$ analytically, so gradient descent cannot find a "better" solution, right?

P.S. : "Better" will be defined with the same performance measure, i.e. sum of the squares of the errors.

Moreover, for non-linear regression (like quadratic, polynomial, or in another kernel space like Gaussian), we can always represent the data matrix $X$ with the related features, so we can again compute a linear regression in this kernel space, right?

So given that we do not have a very big dataset (i.e. matrix inversion is not a problem in terms of computational cost), does gradient descent have any advantage to the least-squares solution in terms of accuracy?

One detail I can think of is finding a solution that has less error, but that indicates overfitting. Currently I don't care for that. So even an over-fitted one, can gradient descent find a "better" (see above) solution than least-squares?

I think the key is choosing the right objective function. For a particular objective function, least squares is the optimal solution. Choose a different objective function and you might get coefficients that are more robust. Alternately, you could use the same objective, but put constraints on the coefficients, like an L1 norm. — John, Sep 18 '15 at 18:04
http://stats.stackexchange.com/questions/160179/do-we-need-gradient-descent-to-find-the-coefficients-of-a-linear-regression-mode/164164#164164 — Mark L. Stone, Sep 18 '15 at 18:05
@John thanks for your answer. Can you give some more detail about the constraints? Do you mean the L1 norm of $\beta$ should be lower than a threshold? And what about the objective functions? I cannot think of any objective function other than the sum of errors' squares. Mark; I'm also coming from Andrew Ng's course :) That thread answers my question for the linear case. But what about non-linear regression? I still cannot see how it can be better than LS when we represent the same problem with the features from non-linear spaces. — jeff, Sep 18 '15 at 18:07
Pick an objective function, least squares, or something else. Add whatever constraints you want. Then solve it with high quality nonlinear optimization software, possibly, but need not be, specialized to nonlinear least squares. Don't try to write your own. Andrew Ng has done a tremendous disservice - his lectures are appalling. If your model is a linear combination of features from a nonlinear space, then that's a linear model. If at least one parameter to be estimated appears nonlinearly, that's a nonlinear model, so use nonlinear least squares,or more generally nonlinear optimization. — Mark L. Stone, Sep 18 '15 at 18:26
You could consider least absolute error as a different error function which could require gradient descent as a solution. Least squares is the maximum likelihood estimator for the cost function "sum of squared errors" and is thus a closed form solution. — aplassard, Sep 18 '15 at 18:30
Least absolute error most assuredly does NOT require gradient decent. Least absolute error of a linearly parameterized problem can be solved as a Linear Programming problem, using mature, efficient, reliable Linear Programming software/algorithms. — Mark L. Stone, Sep 18 '15 at 18:37
@MarkL.Stone I still don't understand the part about linearity. So if one of my features is, let's say ${x_1}^3 * {x_2}^2$, but if I call this $z$ and learn a linear regression from $z$ to $y$, then would this be considered a linear model? — jeff, Sep 18 '15 at 19:48
Yes, if that's an additive part of your model, i.e., b times that, and you are estimating (learning) b. It is only when at least one parameter to be estimated (learned) enters the model nonlnealrly that you will not be able to use linear least squares. — Mark L. Stone, Sep 18 '15 at 20:42
So IIUC, I can convert any nonlinear regression problem into a linear one? If so, is there a disadvantage of this approach? — jeff, Sep 18 '15 at 21:17
You can not convert any nonlinear regression problem into a linear one. Some nonlinear models can be converted to linear, such as by taking log of both sides of y = a*exp(b*x), but if so, the error distribution is transformed, so they are not equivalent. Other nonlinear models can not be transformed to linear. — Mark L. Stone, Sep 18 '15 at 21:33

score 9 · Answer 1 · answered Sep 18 '15 at 18:26

9

No.

These two methods both solve the same problem: minimizing the sum of squares error. One method is much faster than the other, but they are both arriving at the same answer.

This would be akin to asking "which gives a better answer to 10/4: long division or a calculator?"

answered Sep 18 '15 at 18:26

Cliff AB

17,741
1
39
84

Those are two of the worst and most unreliable methods to solve a least squares problem. They may both get it wrong, especially for ill-conditioned problems. – Mark L. Stone Sep 18 '15 at 18:34
As for forming the Normal equations - hey look Ma, I squared the condition number. (BTW, squaring the condition number is a bad thing to do.) – Mark L. Stone Sep 18 '15 at 18:40
1

I agree that gradient descent is a bad algorithm to use. I'm not sure that I agree that $\hat \beta = (X^T X)^{-1} X Y$ is a bad solution. But I believe you are implying that I computed $(X^T X)^{-1}X Y$ in an unstable way. I (intentionally) never specified *how* to approximate this solution. If that value was computed *exactly* (of course assuming $(X^T X)^{-1}$ exists...), this would be the solution. – Cliff AB Sep 18 '15 at 18:43
2

Please advice me which of the zero available computers having infinite precision floating point arithmetic you will be using. – Mark L. Stone Sep 18 '15 at 18:45
You're exactly right: I would use an approximation of $(X^T X)^{-1} X^T Y$. Exactly as I said above (yes, I left out a transpose) . – Cliff AB Sep 18 '15 at 18:48
Yes, and your approximation may have a large error, expecially if X is ill-conditioned, as is VERY common in least squares. – Mark L. Stone Sep 18 '15 at 18:55
1

My point is that if $X^T X$ is full rank, the OLS solution is $(X^T X)^{-1} X^T Y$. Any approximation of the OLS solution is then an approximation of $(X^T X)^{-1} X^T Y$. Of course, some approximations are better than others, but at no point have I said what the best approximation to use is. – Cliff AB Sep 18 '15 at 19:01
1

$\mathbf{X}^{\rm T}\mathbf{X}$ needs to be numerically of full rank, not full rank in infinite precision. – Mark L. Stone Sep 18 '15 at 19:11

rigo · Answer 2 · 2020-02-23T09:02:36.163

OLS solves for BLUE (best linear unbiased estimator) only when all Gauss-Markov assumptions are met. You need a linear model, independence, identical distribution, exogeneity, and homoscedasticity. In scenarios, without linearity, we can still solve for a local minimum using gradient descent. Preferably, stochastic gradient descent with momentum. In terms of finding better solutions than OLS, the answer is do you want to find OLS? If the OLS assumptions are not met then you might need to perform weighted OLS, GLS, Lasso regression, or ridge regression. The model you chose depends on which assumptions your violate and how.

Can gradient descent find a better solution than least squares regression?

2 Answers2