0

I tried to follow Jann Goschenhofer's answer here, but I don't understand

  1. How $x_i^T$ in $Criterion_{Ridge} = \sum_{i=1}^{n}(y_i-x_i^T\beta)^2 + \lambda \sum_{j=1}^p\beta_j^2$ became just $X$ without transpose in $Criterion_{Ridge} = (y-X\beta)^T(y-X\beta) + \lambda\beta^T\beta$
  2. How did he just replace $y^TX\beta$ with $\beta^TX^Ty$ in the break down of the $Criterion_{Ridge}$? He wrote $ = y^Ty - \beta^TX^Ty - y^TX\beta+ \beta^Tx^TX\beta + \lambda\beta^T\beta$ is equal to $ = y^Ty - \beta^TX^Ty - \beta^TX^Ty + \beta^TX^TX\beta + \beta^T\lambda I\beta$ ? If he just used the fact that $(AB)^T=B^TA^T$ then he should have written $(\beta^TX^Ty)^T$ and not just $\beta^TX^Ty$
theateist
  • 231
  • 3
  • 8
  • 1
    For *numbers* (considered as $1\times 1$ matrices) $x$, it is obvious that $x^\prime=x.$ This relation is exploited repeatedly in the algebra. – whuber Jun 18 '18 at 18:44
  • @whuber, can you please explain in more details? Also, what about my second question? – theateist Jun 18 '18 at 20:29

1 Answers1

1

How did he just replace $y^T X \beta$ with $\beta^TX^Ty$

As @Whuber points out - the trick is to see that the two terms you are referring to are actually scalar (not vectors) and so transposing them has no effect as $x^T = x$ for scalar $x$.

You can see these are scalar from their dimensions:

Let $m$ be the number of observations and $n$ be the number of features

  • $y: m \times 1$
  • $\beta: n \times 1$
  • $X: m \times n$
  • $y^T X \beta: (1 \times m) \ (m \times n) \ (n \times 1) = 1 \times 1$
  • $\beta^T X^T y: (1 \times n) \ (n \times m) \ (m \times 1) = 1 \times 1 $

Use vector transpose properties to see they are the same:

  • $(y^T X \beta)^T = \beta^T X^T y^{TT} = \beta^T X^T y $
Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72