In my machine learning course we have been taught that given a new axis $\mathbf{u}_j$ and a datapoint $\mathbf{x}_n$, the projection is $z_j = \mathbf{u}^T\mathbf{x}_n$. The variance of $z_j$ can be thus proved to be $\mathbf{u}_j^T\mathbf{S}\mathbf{u}_j=\lambda_j$. So far so good.
But then my professor said that "the variance of the projected data" is:
$$ \sum_{j=1}^M\mathbf{u}_j^T\mathbf{S}\mathbf{u}_j=\sum_{j=1}^M\lambda_j $$
I fail to understand why the variance of the projected data is the sum of the variance in each new axis. Shouldn't the variance be a matrix? Like $diag(\lambda_1,...,\lambda_M)$?
In max-variance PCA, why is the variance of the projected data equals to $\sum_{j=1}^M\mathbf{u}^T_j\mathbf{S}\mathbf{u}_j$?