2

I've generated 100 random vectors (data points) in n∈[1,...,50] dimensions. I then compared distances between each pair of vectors and calculated the mean value. I've done this for all dimensions using Euclidean distance and then using cosine similarity.

Also worth noting is that all vectors are in range [0, 1)

This is the graph I get:

enter image description here

From what I've read on the topic, there shouldn't be a significant difference between these two measurements and both of them will be affected by the curse of the dimensionality.

AFAIK cosine similarity is only in the interval from [-1, 1] and in my case (all vectors are positive) the interval will be [0, 1]. Is that why the cosine similarity is "much" smaller than Euclidean distance? And also, how come the cosine similarity curve isn't rising with the number of dimensions?

Note: This is a question in my lab exercise: Which one of these measures would you use in high dimension space?

In my opinion there isn't a clear answer to this.

Jamess11
  • 35
  • 5
  • 1
    By generating vectors of arbitrary length, you can make the average Euclidean distance become as large or small as you wish. Everything depends, then, on *how* you generate these random numbers. But that begs the question: because the magnitude of the Euclidean distance is perfectly arbitrary, it makes no sense in general to compare it to the cosine distance. As far as the lab question goes, perhaps the proper response ought to be "for what purpose?" Can you provide some context for that? – whuber Dec 16 '21 at 20:00
  • The purpose would be classification of data. Or more specific, which metric would I choose to get better results I suppose. – Jamess11 Dec 16 '21 at 20:05
  • 1
    Shouldn't the answer depend on the nature of the data and the costs of mis-classification? – whuber Dec 16 '21 at 20:13
  • I assume so, but this is all the context I was given. – Jamess11 Dec 16 '21 at 20:16
  • Thank you for accepting my answer, but Cross Validated etiquette says that you should wait longer than a few minutes after posting a question to accept an answer. People tend not to reply to questions with accepted answers, meaning that an even better answer than mine is unlikely. – Dave Dec 16 '21 at 20:18
  • 2
    Do not forget that cosine is based on vectors of normalized, unit length. CS = 1 - (d^2)/2, where d is the chord distance (a particular case of euclidean distance). – ttnphns Dec 16 '21 at 20:19
  • Oh, I wasn't aware of that. Your answer makes sense to me so I accepted it instantly. I will undo it and wait a little longer to see if someone else pitches in. – Jamess11 Dec 16 '21 at 20:20
  • 1
    My comment contained a hint. Recall that as you add dimensions, the bulk volume of points more and more crawls toward the corners of the hypercube. But the hypersphere - which is what about chord/cosine measure keeps points in the central region. – ttnphns Dec 16 '21 at 20:30
  • @ttnphns That's interesting, I haven't thought about it like that at all. Thanks for the clarification. – Jamess11 Dec 16 '21 at 20:32
  • A full explanation of the behavior you are seeing probably requires you to say explicitly how you generated the data. – guy Dec 16 '21 at 20:33
  • @guy I just used pythons function numpy.random.random(n) in a loop with 100 iterations where n is the number of dimensions. – Jamess11 Dec 16 '21 at 20:34

1 Answers1

3

This makes sense to me. When you add more dimensions, you are, in some sense, giving more ways for the points to be far apart. On the number line, $x_0=0$ and $x_1 = 1$ are only $1$ unit apart. However, if we add a second dimension, $(x_0, y_0) = (0, 0)$ and $(x_1, y_1) = (1, 100)$ are a lot more than one unit apart. However, the angular measurement is always restricted to just the two dimensions spanned by your two points.

We have two questions on here that might interest you.

Why is Euclidean distance not a good metric in high dimensions?

Square loss for "big data"

EDIT

You can decide if this makes you like or dislike cosine distance, but consider the points $(0, 1)\in\mathbb R^2$ and $(1, 0)\in\mathbb R^2$. They have the same cosine distance as $(0, 1)$ and $(2, 0)$, but the Euclidean distances are different.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • I don't follow the logic. If you want information useful for *distinguishing* data, why wouldn't the Euclidean distance perforce be preferable to a bounded metric like the cosine distance? ;-) – whuber Dec 16 '21 at 20:14
  • Regarding which you should use in high dimensions, without more context, the question is too open-ended, though my guess is that the full-credit answer to your homework involves arguing in favor of cosine similarity. – Dave Dec 16 '21 at 20:14
  • This makes intuitive sense to me, but [this](https://stats.stackexchange.com/a/341577/308364) answer made me think otherwise so I wasn't sure. – Jamess11 Dec 16 '21 at 20:15
  • @whuber I do not see where I argue in favor of cosine distance. Could you please point me to where you see it? I would like to edit the post to clarify. – Dave Dec 16 '21 at 20:16
  • Could you clarify, then, what you are referring to by "this makes sense to me"? – whuber Dec 16 '21 at 21:29
  • You have more space to work with to be far apart. The Euclidean distance is going to increase unless points have exactly the same value of the extra vector components. – Dave Dec 16 '21 at 21:34
  • Okay--but it would be nice to see that clarification in your post. On that basis, why wouldn't you recommend using, say, a hyperbolic distance? – whuber Dec 17 '21 at 16:21