Use of Gradient Descent for determing weights / coeffcicient of Linear Regression Models

Question

On the coursera stuff, it is all good, but lack of tutors means some questions are not answered and in the short time only so much can be covered. My university stats days are long behind me.

My understanding is that for determining weights / coefficients for N feature / variable Linear Regression model, that Gradient Descent is always used, not Gradient Ascent. This is due to minimizing cost as this is equivalent to minimizing sum of errors squared (RSS). Or have I missed the point?

Moreover, if may fail if there are local minima.

Matthew Drury · Accepted Answer · 2018-03-20T23:21:49.983

3

My understanding is that for determining weights / coefficients for N feature / variable Linear Regression model, that Gradient Descent is always used, not Gradient Ascent. This is due to minimizing cost as this is equivalent to minimizing sum of errors squared (RSS). Or have I missed the point?

It depends. You get to either minimize this:

$$ \sum_i \left( y_i - (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik})\right)^2 $$

for which you could use gradient descent, or you could maximize this

$$ - \sum_i \left( y_i - (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik})\right)^2 $$

for which you could use gradient ascent. Both problems are completely equivalent, the gradient of one is the negative of the gradient of the other.

In practice though, production software does neither of these things, unless the model is fit at a massive scale. Most of your data to day regressions are fit by solving a system of linear equations you get by setting the gradient to zero directly. Classical linear equation solvers are used for this, which work using matrix factorizations.

Moreover, if may fail if there are local minima.

One of the nice things about linear regression is that there is only one local extrema (which is, consequently, global). The only exceptional case is if your design matrix $X$ is not of full rank, in which case the solutions form an affine subspace.

edited Mar 20 '18 at 23:21

answered Mar 20 '18 at 21:44

Matthew Drury

33,314
2
101
132

So why does Stanford Andres Ng state that there is no guarantee that the Gradient Descent solution found is always correct? OK, things are clearer except for this. – thebluephantom Mar 20 '18 at 21:53
That I don't know. It's possible that the gradient descent will not converge if your stepsize is too large and you end up stepping "over" the optima repeatedly. – Matthew Drury Mar 20 '18 at 21:58
2

"One of the nice things about linear regression is that there is only one local extrema (which is, consequently, global)." Only if the optimization is strictly convex (or concave), which is not the case if the design matrix is not full rank. – Sycorax Mar 20 '18 at 22:24
True, good point. I'll add that to my answer. – Matthew Drury Mar 20 '18 at 23:21
convergence and that understood, but this guy Andrew Ng is a pretty smart cookie to say the least. So, I think I get it but I cannot explain it. I think Ng was talking about a polyniomial linear regression which is neither concave or convex. @MatthewDrury – thebluephantom Mar 21 '18 at 07:17
found this, this is also a good question with answers; https://stats.stackexchange.com/questions/172900/can-gradient-descent-be-applied-to-non-convex-functions – thebluephantom Mar 21 '18 at 07:32
1

@thebluephantom "I think Ng was talking about a polyniomial linear regression which is neither concave or convex." This is incorrect... The **loss function** of a full-rank polynomial regression design is strictly convex in the parameters being estimated! The **model** that you are estimating may be neither convex nor concave, but that is irrelevant in terms of estimation of parameters. – Sycorax Mar 22 '18 at 18:28

Use of Gradient Descent for determing weights / coeffcicient of Linear Regression Models

1 Answers1