Let's say I have $$\mathbf{y} = (y_1, ..., y_N)^T \text{ and } \mathbf{X} \in \mathbb{R}^{M\times N}$$. Likewise, I have $\mathbf{y} = \mathbf{X}\mathbf{w}$ where $\mathbf{w} = (w_0, ..., w_M)^T$. I'm representing SSE as $$||\mathbf{Xw}-\mathbf{d}||_2^2 = (\mathbf{Xw}-\mathbf{d})^T(\mathbf{Xw}-\mathbf{d})$$
My first(minor) question is the following. What does it mean when people bluntly write $$\text{Note: } \nabla \mathbf{w}^T\mathbf{Aw} = (\mathbf{A}+\mathbf{A}^T\mathbf{w})$$ I don't quite understand what $\mathbf{A}$ generally stands for in the domain of curve fitting.
Secondly, the following statement was made: $$\nabla E(\mathbf{w}) = 0$$ $$\mathbf{X}^T(\mathbf{X}\mathbf{w}-\mathbf{d})=0$$ If we have $N \geq M + 1$ distinct $x_i$, the solution is unique: $\mathbf{w}* = (\mathbf{X}^T\mathbf{X})^1\mathbf{X}^T\mathbf{d}$. Otherwise, there is an infinite number of solutions.
Why exactly does that condition imply you have finite/infinite solutions and what does the asterisk on $\mathbf{w}$ represent?
My attempt at understanding the conditional statement is the following: If there are more training samples than the number of features used to describe each sample, then we can uniquely describe each sample. However, if there are more features per sample than samples, then multiple combinations of features can map to the same sample (this is based on the properties of matrices).