23

I read this statement on one old true/false exam:

We can get multiple local optimum solutions if we solve a linear regression problem by minimizing the sum of squared errors using gradient descent.

Solution: False

My question is, which part of this question is wrong? Why is this statement false?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Anjela Minoeu
  • 331
  • 1
  • 2
  • 6

3 Answers3

9

This question is interesting insofar as it exposes some connections among optimization theory, optimization methods, and statistical methods that any capable user of statistics needs to understand. Although these connections are simple and easily learned, they are subtle and often overlooked.

To summarize some ideas from the comments to other replies, I would like to point out there are at least two ways that "linear regression" can produce non-unique solutions--not just theoretically, but in practice.

Lack of identifiability

The first is when the model is not identifiable. This creates a convex but not strictly convex objective function which has multiple solutions.

Consider, for instance, regressing $z$ against $x$ and $y$ (with an intercept) for the $(x,y,z)$ data $(1,-1,0),(2,-2,-1),(3,-3,-2)$. One solution is $\hat z = 1 + y$. Another is $\hat z = 1-x$. To see that there must be multiple solutions, parameterize the model with three real parameters $(\lambda,\mu,\nu)$ and an error term $\varepsilon$ in the form

$$z = 1+\mu + (\lambda + \nu - 1)x + (\lambda -\nu)y + \varepsilon.$$

The sum of squares of residuals simplifies to

$$\operatorname{SSR} = 3\mu^2 + 24 \mu\nu + 56 \nu^2.$$

(This is a limiting case of objective functions that arise in practice, such as the one discussed at Can the empirical hessian of an M-estimator be indefinite?, where you can read detailed analyses and view plots of the function.)

Because the coefficients of the squares ($3$ and $56$) are positive and the determinant $3\times 56 - (24/2)^2 = 24$ is positive, this is a positive-semidefinite quadratic form in $(\mu,\nu,\lambda)$. It is minimized when $\mu=\nu=0$, but $\lambda$ can have any value whatsoever. Since the objective function $\operatorname{SSR}$ does not depend on $\lambda$, neither does its gradient (or any other derivatives). Therefore, any gradient descent algorithm--if it does not make some arbitrary changes of direction--will set the solution's value of $\lambda$ to whatever the starting value was.

Even when gradient descent is not used, the solution can vary. In R, for instance, there are two easy, equivalent ways to specify this model: as z ~ x + y or z ~ y + x. The first yields $\hat z = 1 - x$ but the second gives $\hat z = 1 + y$.

> x <- 1:3
> y <- -x
> z <- y+1

> lm(z ~ x + y)
Coefficients:
(Intercept)            x            y  
          1           -1           NA  


> lm(z ~ y + x)
Coefficients:
(Intercept)            y            x  
          1            1           NA 

(The NA values should be interpreted as zeros, but with a warning that multiple solutions exist. The warning was possible because of preliminary analyses performed in R that are independent of its solution method. A gradient descent method would likely not detect the possibility of multiple solutions, although a good one would warn you of some uncertainty that it had arrived at the optimum.)

Parameter constraints

Strict convexity guarantees a unique global optimum, provided the domain of the parameters is convex. Parameter restrictions can create non-convex domains, leading to multiple global solutions.

A very simple example is afforded by the problem of estimating a "mean" $\mu$ for the data $-1, 1$ subject to the restriction $|\mu| \ge 1/2$. This models a situation that is kind of the opposite of regularization methods like Ridge Regression, the Lasso, or the Elastic Net: it is insisting that a model parameter not become too small. (Various questions have appeared on this site asking how to solve regression problems with such parameter constraints, showing that they do arise in practice.)

There are two least-squares solutions to this example, both equally good. They are found by minimizing $(1-\mu)^2 + (-1-\mu)^2$ subject to the constraint $|\mu| \ge 1/2$. The two solutions are $\mu=\pm 1/2$. More than one solution can arise because the parameter restriction makes the domain $\mu \in (-\infty, -1/2]\cup [1/2, \infty)$ nonconvex:

Plot of sum of squares against $\mu$

The parabola is the graph of a (strictly) convex function. The thick red part is the portion restricted to the domain of $\mu$: it has two lowest points at $\mu=\pm 1/2$, where the sum of squares is $5/2$. The rest of the parabola (shown dotted) is removed by the constraint, thereby eliminating its unique minimum from consideration.

A gradient descent method, unless it were willing to take large jumps, would likely find the "unique" solution $\mu=1/2$ when starting with a positive value and otherwise it would find the "unique" solution $\mu=-1/2$ when starting with a negative value.

The same situation can occur with larger datasets and in higher dimensions (that is, with more regression parameters to fit).

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    A very simple example of a convex function which is not strictly convex and do have infinitely many minima is $f(x,y) = (x-y)^2$. Any point on the line $y=x$ is a minimum point. – kjetil b halvorsen Jun 13 '17 at 15:45
  • 1
    @Kjetil Thank you, that's true. The trick here is to show how such functions actually arise in regression situations. Your function is precisely the inspiration for the first example I offered. – whuber Jun 13 '17 at 15:50
  • A visual example https://stats.stackexchange.com/a/151351/171583. – ayorgo Aug 02 '19 at 11:14
2

I'm afraid there is no binary answer to your question. If Linear regression is strictly convex (no constraints on coefficients, no regularizer etc.,) then gradient descent will have a unique solution and it will be global optimum. Gradient descent can and will return multiple solutions if you have a non-convex problem.

Although OP asks for a linear regression, the below example shows least square minimization although nonlinear (vs. linear regression which OP wants) can have multiple solutions and gradient descent can return different solution.

I can show empirically using a simple example that

  1. Sum of squared errors can some time be non-convex, therefore have multiple solutions
  2. Gradient descent method can provide multiple solutions.

Consider the example where you are trying to minimize least squares for the following problem:

enter image description here

where you are trying to solve for $w$ by minimizing objective function. The above funtion although differentiable is non-convex and can have multiple solution. Substituting actual values for $a$ see below.

$a_{12} =9,a_{13} = 1/9,a_{23}=9,a_{31}=1/9$

$minimize$ ${(9-\frac{w_1}{w_2})^2+(\frac{1}{9}-\frac{w_1}{w_3})^2+(\frac{1}{9}-\frac{w_2}{w_1})^2+(9-\frac{w_2}{w_3})^2+(9-\frac{w_3}{w_1})^2+(\frac{1}{9}-\frac{w_3}{w_2})^2}$

The above problem has 3 different solution and they are as follows:

$w = (0.670,0.242,0.080),obj = 165.2$

$w = (0.080,0.242,0.670),obj = 165.2$

$w = (0.242,0.670,0.080),obj = 165.2$

As shown above the least squares problem can be nonconvex and can have multiple solution. Then above problem can be solved using gradient descent method such as microsoft excel solver and every time we run we end up getting different solution. since gradient descent is a local optimizer and can get stuck in local solution we need to use different starting values to get true global optima. A problem like this is dependent on starting values.

forecaster
  • 7,349
  • 9
  • 43
  • 81
  • 3
    I don't think this answers OP's question because OP asks specifically about *linear regression*, not optimization in general. – Sycorax Mar 30 '15 at 19:36
  • 1
    No it does not, but just trying to make a point on problems with optimizes, will update with caveats – forecaster Mar 30 '15 at 19:37
  • @user777 you are right. this is a very valid question on old exam from MIT. I'm sure the answer is false with thanks to forecastet. – Anjela Minoeu Mar 30 '15 at 19:38
  • so are u sure that I am right? – Anjela Minoeu Mar 30 '15 at 20:02
  • @AnjelaMinoeu, I have updated my response. – forecaster Mar 30 '15 at 20:32
  • you means in case of linear regression problem by minimizing the sum of squared errors using gradient descent and convexity there is one global optimum otherwise multiple. am I right? – Anjela Minoeu Mar 30 '15 at 20:53
  • yes. The key thing missing in your question is "strictly convex". If "strictly convex" is present then its straight forward, the answer is False. Since it is not, there is no binary response. – forecaster Mar 30 '15 at 21:11
1

This is because the objective function you are minimizing is convex, there is only one minima/maxima. Therefore, the local optimum is also a global optimum. Gradient descent will find the solution eventually.

Why this objective function is convex? This is the beauty of using the squared error for minimization. The derivation and equality to zero will show nicely why this is the case. It is pretty a textbook problem and is covered almost everywhere.

Vladislavs Dovgalecs
  • 2,315
  • 15
  • 18
  • 4
    Convexity does not imply a unique minimum. Typically you need to appeal to *strict* convexity of an objective function *defined on a convex domain.* Also an issue here are the termination criteria for gradient descent using floating point arithmetic: even when the objective function is strictly convex, the algorithm is *likely* to find different solutions (depending on starting values) when the function is nearly flat near its minimum. – whuber Mar 30 '15 at 16:09
  • @whuber would you please make it simpler and clear for me? – Anjela Minoeu Mar 30 '15 at 16:18
  • @whuber I think the first issue is the use of terminology. Second, convexity does imply a unique minimum. I can't see a differentiable concave function which does not have a single minimum/maximum. See proof here: http://planetmath.org/localminimumofconvexfunctionisnecessarilyglobal – Vladislavs Dovgalecs Mar 30 '15 at 16:18
  • 3
    I haven't bothered to read the proof, because it *must* invoke strict convexity to be correct. A least-squares problem with unidentifiable coefficients will be convex but not strictly convex, and thereby will have (infinitely) many solutions. But that's not completely relevant to gradient descent, which has its own problems--some of which are clearly discussed in the [Wikipedia article](http://en.wikipedia.org/wiki/Gradient_descent#Limitations). Thus, in both theoretical and practical senses, the correct answer to the question is *true*: gradient descent can--and will--give multiple solutions. – whuber Mar 30 '15 at 16:41
  • @whuber Yes, the proof appeals to the strict convexity. – Vladislavs Dovgalecs Mar 30 '15 at 17:20
  • @Whuber is correct. The answer to this question is True. Xeon, You raise the question `"Why this objective function is convex? This is the beauty of using the squared error for minimization"` There are instances where you can have squared error that is non-convex AND gradient descent will find multiple solutions if you run it multiple times. – forecaster Mar 30 '15 at 17:32
  • @forecaster Not trying to keep my face or just being annoying, can you please provide one example where a squared error is non-convex, please? – Vladislavs Dovgalecs Mar 30 '15 at 17:59
  • @xeon other say this statement is true? you say is false and agree with me? am i right? – Anjela Minoeu Mar 30 '15 at 19:18
  • I agree with you about that, xeon: a sum of squared errors will always be a convex function (but not necessarily strictly convex). A subtler issue concerns the domain of definition: as soon as any restrictions are made on the parameters--such as requiring them to be non-negative, or bounded by a given value, etc.--then the convexity (or lack thereof) of the domain plays an important role, too. Arguably one is still doing a "linear regression problem" in such a circumstance--although I doubt this broader meaning was the one intended in that exam. – whuber Mar 30 '15 at 19:52
  • @AnjelaMinoeu I would've answered to that exam question with False. – Vladislavs Dovgalecs Mar 30 '15 at 20:47
  • @whuber I agree when it concerns also the domain of the definition. It must also be convex. All this makes me to revise the basics, which is good provided if I learn something better. – Vladislavs Dovgalecs Mar 30 '15 at 20:49
  • you means in case of linear regression problem by minimizing the sum of squared errors using gradient descent and convexity there is one global optimum otherwise multiple. am I right? – Anjela Minoeu Mar 30 '15 at 20:54
  • *In theory,* if gradient descent could be carried out infinitely far (which is obviously impossible) using exact calculations with infinite precision (which is impossible), then it would converge to a unique value in any unconstrained linear regression problem with identifiable parameters. Because gradient descent is a *practical* method, however, none of this matters. In practice, its answer will depend on the starting value used for the search and it will not be exactly the correct value, nor will it be as close to the correct value as some other methods can produce. – whuber Mar 30 '15 at 21:02
  • @whuber Even if gradient descent is carried out (for an infinite number of steps) in infinite precision on a strictly convex objective function subject to convex constraints, it may fail to converge to anything. – Mark L. Stone Jul 02 '17 at 23:52
  • @Mark I see you are correct: I had implicitly assumed that all gradients are nonzero (and, indeed, sufficiently large) except at a unique global optimum. I don't think that vitiates the point I was making, though. – whuber Jul 03 '17 at 01:04