6

I was reading Wikipedia's article on linear regression and I realized that I don't quite get the same result for the normal equation:

$$\beta = (X^{T}X)^{-1}X^{T}{\bf{y}} = (\frac{1}{n}\sum {\bf{x}}_{i}{\bf{x}}_{i}^{T})^{-1}(\frac{1}{n}\sum {\bf{x}}_{i}{{y}}_{i})$$

where:

$${\bf{y}}=\begin{pmatrix}y_{1}\\y_{2}\\ ...\\ y_{n}\end{pmatrix}$$ $${\bf{X}}=\begin{pmatrix}{\bf{x}}_{1}^{T}\\{\bf{x}}_{2}^{T}\\ ...\\ {\bf{x}}_{n}^{T}\end{pmatrix}$$ $${\bf{\beta}}=\begin{pmatrix}\beta_{1}\\\beta_{2}\\ ...\\ \beta_{p}\end{pmatrix}$$

Supose we want to minimize an error function $E_{D}$ on the parameters $\beta$:

$$E_{D} = \frac{1}{2} \sum_{n=1}^{N} \{y_{n}-\beta^{T}x_{n}\}^{2}$$

so we use the gradient on $\beta$ in order to obtain:

$$\nabla E_{D} = \sum_{n=1}^{N} \{y_{n}-\beta^{T}x_{n}\}x_{n}^{T}=0$$

and solving for $\beta$

$$\beta^{T}=\frac{\sum_{n=1}^{N}y_{n}x_{n}^{T}}{\sum_{n=1}^{N}x_{n}x_{n}^{T}}$$

Taking $\bf{X}$ as described before, we find the correct denominator because

$${\bf{X}}^{T}{\bf{X}}=( \bf{x}_{1} \thinspace \bf{x}_{2} ... \bf{x}_{N} ) \begin{pmatrix}{\bf{x}}_{1}^{T}\\{\bf{x}}_{2}^{T}\\ ...\\ {\bf{x}}_{n}^{T}\end{pmatrix}=\sum_{n=1}^{N}x_{n}x_{n}^{T}$$

However, in the denominator my calculation produces a row vector instead of a column vector. What did I get wrong?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Robert Smith
  • 3,191
  • 3
  • 30
  • 46

2 Answers2

1

Note: In an attempt to add more details to my question, I ended up with a possible correct answer. Further comments will be highly appreciated

Starting from:

$$\nabla E_{D} = \sum_{n=1}^{N} \{y_{n}-\beta^{T}x_{n}\}x_{n}^{T}=0$$

Simplifying a bit:

$$\sum_{n=1}^{N}y_{n}x_{n}^{T}=\beta^{T}\left(\sum_{n=1}^{N}x_{n}x_{n}^{T}\right)$$

The term inside the parenthesis is a sum of matrices, so assuming the inverse exists:

$$\beta^{T}=\sum_{n=1}^{N}y_{n}x_{n}^{T}\left(\sum_{n=1}^{N}x_{n}x_{n}^{T}\right)^{-1}$$

Applying the transpose:

$$\beta=\left(\sum_{n=1}^{N}x_{n}x_{n}^{T}\right)^{-1}\sum_{n=1}^{N}x_{n}y_{n}$$

Taking $\bf{X}$ as described before, we find the correct denominator because

$${\bf{X}}^{T}{\bf{X}}=( \bf{x}_{1} \thinspace \bf{x}_{2} ... \bf{x}_{N} ) \begin{pmatrix}{\bf{x}}_{1}^{T}\\{\bf{x}}_{2}^{T}\\ ...\\ {\bf{x}}_{n}^{T}\end{pmatrix}=\sum_{n=1}^{N}x_{n}x_{n}^{T}$$

and

$${\bf{X^{T}y}}=( {\bf{x}}_{1} \thinspace {\bf{x}}_{2} ... {\bf{x}}_{N} ) \begin{pmatrix}y_{1}\\y_{2}\\ ...\\ y_{n}\end{pmatrix}={{\bf{x}}_{1}}y_{1}+{\bf{x}}_{2}y_{2}+...+{\bf{x}}_{N}y_{n}$$

which is the first and second part of the equation:

$$\beta = (X^{T}X)^{-1}X^{T}{\bf{y}}$$

Looks like I got the correct result. Can you catch a mistake? I still find this calculations somewhat uncomfortable to deal with, particularly finding the correct derivatives.

Robert Smith
  • 3,191
  • 3
  • 30
  • 46
0

I think I found the problem: when you went from the gradient to solving for $\beta$, you didn't reverse the order of the terms.

The inversion should have been

$$(x_nx^T_n)^{-1}=\frac{1}{x^T_nx_n},$$

so that

$$\beta^T = \frac{\sum y_nx_n^T}{\sum x^T_nx_n}.$$

See the matrix cookbook (pdf) on page 5.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
wcampbell
  • 2,099
  • 17
  • 19
  • Thanks for your answer. I don't think that's the problem but you're right in pointing out this issue. I will rewrite my attempt of proof more carefully. – Robert Smith Apr 02 '13 at 03:01
  • This makes no sense unless $x_n$ is a *scalar,* in which case the distinction between $x_n$ and $x_n^\prime$ is irrelevant. – whuber Jul 27 '20 at 20:17