I am trying to derive the first principal component direction from the definition and need help in finding which step is going wrong. Here's my attempt:
$\mathbf{X} \in \mathbb{R}^{N \times p}$ is the centered data matrix, then finding the first principal component direction involves finding a vector $\mathbf{v}$ such that when $\mathbf{X}$ is projected onto $\mathbf{v}$, the variance of the projected data is maximized.
Now, the variance of data projected is given by
$$ \begin{aligned} \hat{\sigma}^2 &= \frac{1}{N} \frac{\mathbf{v}^\intercal\mathbf{X}^\intercal\mathbf{X}\mathbf{v}}{(\mathbf{v}^\intercal\mathbf{v})^2} \\ &= \frac{\mathbf{v}^\intercal \mathbf{S}\mathbf{v}}{(\mathbf{v}^\intercal\mathbf{v})^2} \end{aligned} $$ where $\mathbf{S}$ is the sample covariance matrix of the original data.
Now $\mathbf{v}$ that maximizes $\hat{\sigma}^2$ should satisfy $\frac{d }{d \mathbf{v}} \hat{\sigma}^2 = 0$.
$$ \begin{aligned} d (\hat{\sigma}^2) &= d \Big(\frac{\mathbf{v}^\intercal \mathbf{S}\mathbf{v}}{(\mathbf{v}^\intercal\mathbf{v})^2}\Big) \\ &= \frac{d \big(\mathbf{v}^\intercal \mathbf{S}\mathbf{v}\big)}{(\mathbf{v}^\intercal\mathbf{v})^2} + (\mathbf{v}^\intercal \mathbf{S}\mathbf{v} ) d\big( (\mathbf{v}^\intercal\mathbf{v})^{-2}\big) \\ &= \frac{2 \mathbf{v}^\intercal \mathbf{S}}{(\mathbf{v}^\intercal\mathbf{v})^2} d \mathbf{v} + (\mathbf{v}^\intercal \mathbf{S}\mathbf{v} ) (-2) \frac{2 \mathbf{v}^\intercal}{(\mathbf{v}^\intercal\mathbf{v})^3} d \mathbf{v} \end{aligned} $$
$$ \frac{d}{d \mathbf{v}} \hat{\sigma}^2 = \frac{2}{(\mathbf{v}^\intercal\mathbf{v})^2} \Big(\mathbf{v}^\intercal \mathbf{S} - \frac{2 (\mathbf{v}^\intercal \mathbf{S}\mathbf{v}) \mathbf{v}^\intercal}{\mathbf{v}^\intercal\mathbf{v}} \Big) $$
Setting the derivative to zero (and taking a transpose) gives me,
$$ \begin{aligned} \mathbf{S}\mathbf{v} &= 2 \frac{\mathbf{v}^\intercal\mathbf{S}\mathbf{v}}{\mathbf{v}^\intercal\mathbf{v}} \mathbf{v} \\ &=2 \frac{\mathbf{v}^\intercal\mathbf{S}\mathbf{v}}{(\mathbf{v}^\intercal\mathbf{v})^2} (\mathbf{v}^\intercal\mathbf{v})\mathbf{v} \\ &= 2 \hat{\sigma}^2 (\mathbf{v}^\intercal\mathbf{v}) \mathbf{v} \end{aligned} $$
From above, I can see that $\mathbf{v}$ has to be an eigenvector of $\mathbf{S}$. To ensure the uniqueness of the first principal component direction, I enforce that $\mathbf{v}$ has to be a unit vector which gives me $$ \mathbf{S}\mathbf{v} = 2 \hat{\sigma}^2 \mathbf{v} $$
Now, to maximize the variance, $\mathbf{v}$ has to be the principal eigenvector of $\mathbf{S}$ (because eigenvalue is proportional to the variance).
However, something has to be wrong here because I know that variance is not just proportional, but equal to the largest eigenvalue of $\mathbf{S}$. Where did I go wrong?