5

In the picture below, why is u the direction of greatest variance? Aren't the data points further away from v than from u? Variance is the sum over the squared distances of each data point from the mean, right?

It would just seem to me that if I measured and summed the distances of the data points to v, it would be greater than the sum of the distances of each data point to u.

enter image description here

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user3813234
  • 171
  • 7
  • 2
    -1 You need to post figures to your question, not leave links for someone to have to follow. You should also use standard formats that people will be able to open without difficulty (eg, png, jpeg, etc). – gung - Reinstate Monica Jun 21 '17 at 18:00
  • 2
    Because $\mathbf v$ is a vector, not a location, "the distances of the data points to v" is not meaningful. If you were to tilt your head enough to make $\mathbf u$ point horizontally and $\mathbf v$ point up, then you could interpret the variance in the $\mathbf u$ direction as the variance of the horizontal coordinates (and the variance in the $\mathbf v$ direction as the variance of the vertical coordinates). Which coordinates have the greater spread? – whuber Jun 21 '17 at 18:51
  • 3
    It should be obvious from the picture that if you look at the center of the ellipse the points spread out much more widely in the u direction than in the v direction. So that is the direction that explains most of the variation in the data. – Michael R. Chernick Jun 21 '17 at 18:51
  • 4
    +1. I think the Q is good because the confusion is subtle. Projections on $u$ have the same length as distances to $v$, hence maximizing the variance of the projections on $u$ is the same as maximizing the sum of squared distances to $v$. Does the animation in my answer here https://stats.stackexchange.com/a/140579/28666 (and surrounding text) help? (Thanks @gung for editing.) – amoeba Jun 22 '17 at 16:06

2 Answers2

2

Your confusion here comes from misunderstanding of how Cartesian coordinates work. Remember: the orthogonal distances of the points from the axis labeled $\mathbf{v}$ are the $u$ coordinates. That is, they measure the distance parallel to the vector $\mathbf{u}$ from the origin. You are absolutely correct that the variability of these distances away from the vector $\mathbf{v}$ is greater than the corresponding variance of the distances from the vector $\mathbf{u}$ --- but these distances are the $u$ coordinates, not the $v$ coordinates!

Ben
  • 91,027
  • 3
  • 150
  • 376
0

The direction of greatest variance represents the direction in which you would encounter all the greatest variation in the data points (Minimum, maximum, average) values; their variance or possibly range should be highest. In our case, it is the direction 'u'. In the direction 'v', the data points do not vary as much as they do in the direction 'u'