Help in understanding the ridge regression solution break down?

Question

I tried to follow Jann Goschenhofer's answer here, but I don't understand

How $x_i^T$ in $Criterion_{Ridge} = \sum_{i=1}^{n}(y_i-x_i^T\beta)^2 + \lambda \sum_{j=1}^p\beta_j^2$ became just $X$ without transpose in $Criterion_{Ridge} = (y-X\beta)^T(y-X\beta) + \lambda\beta^T\beta$
How did he just replace $y^TX\beta$ with $\beta^TX^Ty$ in the break down of the $Criterion_{Ridge}$? He wrote $ = y^Ty - \beta^TX^Ty - y^TX\beta+ \beta^Tx^TX\beta + \lambda\beta^T\beta$ is equal to $ = y^Ty - \beta^TX^Ty - \beta^TX^Ty + \beta^TX^TX\beta + \beta^T\lambda I\beta$ ? If he just used the fact that $(AB)^T=B^TA^T$ then he should have written $(\beta^TX^Ty)^T$ and not just $\beta^TX^Ty$

For *numbers* (considered as $1\times 1$ matrices) $x$, it is obvious that $x^\prime=x.$ This relation is exploited repeatedly in the algebra. — whuber, Jun 18 '18 at 18:44
@whuber, can you please explain in more details? Also, what about my second question? — theateist, Jun 18 '18 at 20:29

score 1 · Answer 1 · answered Jun 19 '18 at 07:41

How did he just replace $y^T X \beta$ with $\beta^TX^Ty$

As @Whuber points out - the trick is to see that the two terms you are referring to are actually scalar (not vectors) and so transposing them has no effect as $x^T = x$ for scalar $x$.

You can see these are scalar from their dimensions:

Let $m$ be the number of observations and $n$ be the number of features

$y: m \times 1$
$\beta: n \times 1$
$X: m \times n$
$y^T X \beta: (1 \times m) \ (m \times n) \ (n \times 1) = 1 \times 1$
$\beta^T X^T y: (1 \times n) \ (n \times m) \ (m \times 1) = 1 \times 1 $

Use vector transpose properties to see they are the same:

$(y^T X \beta)^T = \beta^T X^T y^{TT} = \beta^T X^T y $

Help in understanding the ridge regression solution break down?

1 Answers1

You can see these are scalar from their dimensions:

Use vector transpose properties to see they are the same: