0

Let's say I have $$\mathbf{y} = (y_1, ..., y_N)^T \text{ and } \mathbf{X} \in \mathbb{R}^{M\times N}$$. Likewise, I have $\mathbf{y} = \mathbf{X}\mathbf{w}$ where $\mathbf{w} = (w_0, ..., w_M)^T$. I'm representing SSE as $$||\mathbf{Xw}-\mathbf{d}||_2^2 = (\mathbf{Xw}-\mathbf{d})^T(\mathbf{Xw}-\mathbf{d})$$

My first(minor) question is the following. What does it mean when people bluntly write $$\text{Note: } \nabla \mathbf{w}^T\mathbf{Aw} = (\mathbf{A}+\mathbf{A}^T\mathbf{w})$$ I don't quite understand what $\mathbf{A}$ generally stands for in the domain of curve fitting.

Secondly, the following statement was made: $$\nabla E(\mathbf{w}) = 0$$ $$\mathbf{X}^T(\mathbf{X}\mathbf{w}-\mathbf{d})=0$$ If we have $N \geq M + 1$ distinct $x_i$, the solution is unique: $\mathbf{w}* = (\mathbf{X}^T\mathbf{X})^1\mathbf{X}^T\mathbf{d}$. Otherwise, there is an infinite number of solutions.

Why exactly does that condition imply you have finite/infinite solutions and what does the asterisk on $\mathbf{w}$ represent?

My attempt at understanding the conditional statement is the following: If there are more training samples than the number of features used to describe each sample, then we can uniquely describe each sample. However, if there are more features per sample than samples, then multiple combinations of features can map to the same sample (this is based on the properties of matrices).

Christian
  • 1,382
  • 3
  • 16
  • 27
  • 1
    (1) The "blunt" formula must have a typographical error, because it makes no sense in terms of matrix multiplication. Most likely you forgot a factor of $w^\prime.$ If so, it's important to state the derivative is taken with respect to $w,$ not with respect to $A.$ (2) $A$ plays the role of $X^\prime X$ in your representation of the MSE. (3) The star on $w$ indicates it minimizes the error. (4) Your question is answered [elsewhere on this site.](https://stats.stackexchange.com/search?q=matrix+rank+inverse+more) – whuber Oct 04 '19 at 20:57
  • So would it be more accurate to write $\nabla_w \mathbf{w}^T\mathbf{A}\mathbf{w}$ – Christian Oct 04 '19 at 21:01
  • That would be clearer. BTW, we have many threads on [matrix calculus](https://stats.stackexchange.com/search?q=matrix+calculus). See especially https://stats.stackexchange.com/questions/236411, https://stats.stackexchange.com/questions/206332, and https://stats.stackexchange.com/questions/246738 (for a rigorous account with modern notation). – whuber Oct 04 '19 at 21:03
  • Can you please elaborate on your 2nd point? – Christian Oct 04 '19 at 21:03
  • $(Xw-d)^\prime(Xw-d) = w^\prime(X^\prime X)w - d^\prime Xw - w^\prime X^\prime d + d^\prime d.$ The leading (quadratic) term takes the form $w^\prime A w$ with $A=X^\prime X.$ – whuber Oct 04 '19 at 21:05
  • Ahh I see now thank you – Christian Oct 04 '19 at 21:21

0 Answers0