They are not necessarily confined to the diagonal. The underlying assumption behind CCA is that $X$ and $Y$ share some low-dimensional latent factor, so that $X \approx Az_x + Cz$ and $Y \approx Bz_y + Dz$ with $z_x, z_y, z$ all independent. CCA approximates the shared latent factor $z$ from both ends by trying to find projectors $w$ and $v$ that invert $C$ and $D$ (and map the column spaces of A or B to 0). But, it's not guaranteed to succeed perfectly, nor are the data guaranteed to conform to that generative model.
Example
If you believe the model, then the distance between the samples and the diagonal would measure both contamination by noise (implied by the $\approx$) and the variation due to $z_x$ or $z_y$. For example, if $A=B=0$ and $C$, $D$ are $\begin{bmatrix} 1 &1 \\ 1 & 0 \\ 1 & 0 \end{bmatrix}$, and if the $\approx$ is taken to mean that iid Gaussian noise gets added to the entries after everything is finished, then CCA might use the following projections. (Caveat: I haven't done the actual math to confirm these are correct, and they also are not normalized).
- $[0, 1, 1 ]$ (reconstructs the first latent coordinate)
- $[2, -1, -1]$ (reconstructs the second latent coordinate)
But even though I've given CCA the oracular truth about the projections it should be using, the noise means that those projection cannot exactly reconstruct the latent variates as the occurred during the data generation. If we make $A$ and $B$ nonzero, then that affects the results as well.
Worst-case scenario
In the situation that $A=B=C=D$, the projections are not even reconstructing $z_i$ for case $i$. They are at best recovering $z_{x,i}+ z_i$ and $z_{y,i} + z_i$. If $A=500C$, then this worsens to $500z_{x,i} + z_i$. This is no longer a problem with estimation error so much as a fundamental identifiability issue. Even infinite data won't help. If $A$ and $B$ are merely similar to $C$ and $D$, that will also make things really difficult, unless you have an infinite amount of data.