I was reading Wikipedia's article on linear regression and I realized that I don't quite get the same result for the normal equation:
$$\beta = (X^{T}X)^{-1}X^{T}{\bf{y}} = (\frac{1}{n}\sum {\bf{x}}_{i}{\bf{x}}_{i}^{T})^{-1}(\frac{1}{n}\sum {\bf{x}}_{i}{{y}}_{i})$$
where:
$${\bf{y}}=\begin{pmatrix}y_{1}\\y_{2}\\ ...\\ y_{n}\end{pmatrix}$$ $${\bf{X}}=\begin{pmatrix}{\bf{x}}_{1}^{T}\\{\bf{x}}_{2}^{T}\\ ...\\ {\bf{x}}_{n}^{T}\end{pmatrix}$$ $${\bf{\beta}}=\begin{pmatrix}\beta_{1}\\\beta_{2}\\ ...\\ \beta_{p}\end{pmatrix}$$
Supose we want to minimize an error function $E_{D}$ on the parameters $\beta$:
$$E_{D} = \frac{1}{2} \sum_{n=1}^{N} \{y_{n}-\beta^{T}x_{n}\}^{2}$$
so we use the gradient on $\beta$ in order to obtain:
$$\nabla E_{D} = \sum_{n=1}^{N} \{y_{n}-\beta^{T}x_{n}\}x_{n}^{T}=0$$
and solving for $\beta$
$$\beta^{T}=\frac{\sum_{n=1}^{N}y_{n}x_{n}^{T}}{\sum_{n=1}^{N}x_{n}x_{n}^{T}}$$
Taking $\bf{X}$ as described before, we find the correct denominator because
$${\bf{X}}^{T}{\bf{X}}=( \bf{x}_{1} \thinspace \bf{x}_{2} ... \bf{x}_{N} ) \begin{pmatrix}{\bf{x}}_{1}^{T}\\{\bf{x}}_{2}^{T}\\ ...\\ {\bf{x}}_{n}^{T}\end{pmatrix}=\sum_{n=1}^{N}x_{n}x_{n}^{T}$$
However, in the denominator my calculation produces a row vector instead of a column vector. What did I get wrong?