1

My understanding is that the cost function is not really part of the calculation of coefficients in OLS, which can be derived in close form. However, it comes into play when regularization is introduced.

Differentiating the cost function with respect to the estimated coefficients is the method.

The cost function would be generally expressed as:

$$J(\hat \beta)= (y - {\bf X}\hat \beta)^T(y- {\bf{X} \hat \beta})= \displaystyle \sum_{i=1}^n (y_i - x_i^T\hat \beta)^2= \sum_{i=1}^n(y_i - \hat y_i)^2$$

Expanding the quadratic in matrix notation:

$$J(\hat \beta)= (y - {\bf X}\hat \beta)^T(y- {{\bf X} \hat \beta})= y^Ty + \color{blue}{\hat \beta^T\,X^TX\,\hat \beta} - 2y^TX\hat \beta$$

The term in blue is the only non-scalar term remaining, and I presume that if setting the equation equal to zero to calculate the coefficients with a minimum cost function has to work, $\color{blue}{\hat \beta^T\,X^TX\,\hat \beta}$ must be positive definite. I know that $\color{blue}{X^TX}$ is positive semidefinite. But if all the above statements are correct, how can we proof that $\color{blue}{\hat \beta^T\,X^TX\,\hat \beta}$ is positive definite?

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
  • Several comments: $\beta^TX^TX\beta$ is a scalar! It's also a positive semi-definite 1x1 matrix. One way to see that is $\beta^TX^TX\beta = (X\beta)^T(X\beta)$, and a matrix (in this case $X\beta$) times its transpose is positive semi-definite. I really don't understand your comment "My understanding is that the cost function is not really part of the calculation of coefficients in OLS, which can be derived in close form. However, it comes into play when regularization is introduced". – Matthew Gunn Aug 19 '16 at 14:32
  • @MatthewGunn Yes, it is tentative, and I would appreciate your clarification. The thought is that you don't need to optimize a cost function with gradient descent (for example) when looking for the coefficients in OLS, it's simply $\hat \beta = (X^TX)^{-1}X^Ty$. Isn't that true? – Antoni Parellada Aug 19 '16 at 14:37
  • (1) As you know, $J$ is merely a compact way to write "sum of squares." The sum of squares of numbers is zero if and only if all the numbers are zero. Note, though, that $J$ is not a quadratic form in $\hat\beta$, so the usual meaning of "positive definite" does not apply. (2) The blue expression is a *number*. Strictly speaking, then, (a) it has to be viewed as a $1\times 1$ matrix for "positive definite" to make any sense and (b) it will be positive definite if and only if that number is positive. That's not always the case: the value could be zero. What are you actually asking, then? – whuber Aug 19 '16 at 14:38
  • 1
    @whuber I think the question is more than solved after your comments. Frankly, I didn't realize that the expression in blue resolved itself into a scalar. – Antoni Parellada Aug 19 '16 at 14:41
  • @AntoniParellada The solution to unconstrained minimization of the sum of squares can be expressed as the solution to the linear system $(X^TX) \hat{\beta} = (X^Ty)$. You're right that there's generally no need for iterative, numerical optimization routines (but there is a need for a solver of linear systems). The cost function still matters though because that's how $\hat{\beta} = (X^TX)^{-1}X^Ty$ was derived! (eg. see http://stats.stackexchange.com/a/186289/97925). – Matthew Gunn Aug 19 '16 at 14:49
  • @MatthewGunn Do you mean that as a historical fact? Because [it can be derived geometrically](http://rinterested.github.io/statistics/OLS_linear_algebra.html) since $\left(\mathbf{X^TX}\right)^{-1} \, \mathbf{X}^T$ is the projection matrix of $y$ on the column space of $X$. – Antoni Parellada Aug 19 '16 at 14:54
  • @AntoniParellada Yes, that's an equivalent way to look at it when you define the inner product as $\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x}^T \mathbf{y}$. The inner product of the residual vector with itself (i.e. sum of squares) is minimized when the residual vector is orthogonal to the column space of $X$, which is equivalent to $\hat{y}$ being the projection of $y$ onto the column space of $X$. The projection way of thinking about ordinary least squares regression is probably the most useful. – Matthew Gunn Aug 19 '16 at 15:21

1 Answers1

3

Since $X\,\hat{\beta}$ is a vector, $\hat \beta^T\,X^TX\,\hat \beta$ is just $||X\,\hat{\beta}||_2^2$, a scalar, for which 'positive definite' means 'positive'. And it is positive iff $X\,\hat{\beta}\neq \mathbf{0}$. But nothing prevents $X\,\hat{\beta}$ from being $\mathbf{0}$ -- it's possible that $\hat{\beta}=\mathbf{0}$.

Juho Kokkala
  • 7,463
  • 4
  • 27
  • 46