6

I am aware of Cosine Similarity which measure the angle between "two" vectors. Prototype for cosine similarity would look something like this:

float cosine_similarity(vector<float> a, vector<float> b);

Are there any similarity measures that measure the similarity between "n" vectors? Prototype of this function would look something like this:

float cosine_similarity(vector<vector<float>> vectors);

PS: I am NOT looking for clustering/classification sort of a thing.

I will appreciate if someone can point out in some direction.

Thanks!

silent_dev
  • 557
  • 1
  • 6
  • 16
  • 1
    Would you like a single number to represent the similarity among all the samples in a group? What should this value signify in your case? The purity of the group? Or perhaps an average of all pairwise cosine similarities? – turdus-merula Oct 07 '16 at 22:36
  • If you start to look at all pairwise cosine similarities, then this falls under a [standard class](http://stats.stackexchange.com/a/22520/127790) of "linear dependence" metrics. This suggests a natural possibility could be linear [dimension reduction](http://stats.stackexchange.com/a/134283/127790) techniques. – GeoMatt22 Oct 08 '16 at 02:54
  • @GeoMatt22 could you please elaborate? – silent_dev Oct 08 '16 at 07:37
  • @user115202 yeah I'd like for it to be a single number but definitely more sophisticated than an average of cosine similarities. – silent_dev Oct 08 '16 at 07:38
  • 3
    @user3667569 there are many variants, and the best choice will vary depending on your data and goals. However, the basic idea is do an [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition#Low-rank_matrix_approximation) on a [data matrix](https://en.wikipedia.org/wiki/Design_matrix) of your vectors. A simple example would be to use the ratio of the first singular value and the matrix norm, which is similar to an $R^2$, giving the "fraction of variance explained", a number between 0 and 1. ([PageRank](https://en.wikipedia.org/wiki/Google_matrix) is similar.) – GeoMatt22 Oct 08 '16 at 08:30
  • @GeoMatt22 possible to give an example of what you said as an answer? – silent_dev Oct 08 '16 at 22:15
  • @user3667569 I put an answer, as you requested. – GeoMatt22 Oct 09 '16 at 05:13
  • @GeoMatt22: Thanks, I will wait for some more time for the answer before accepting it as the final answer. Thanks again, gave me some important things to think about. – silent_dev Oct 09 '16 at 13:48

1 Answers1

7

The cosine similarity between two column vectors $x_1$ and $x_2$ is simply the dot product between their unit vectors $$\mathrm{CosSim}[x_1,x_2]=\frac{x_1}{\|x_1\|}\bullet\frac{x_2}{\|x_2\|}$$ and varies from -1 to +1, similar to a correlation coefficient $R$ (to which it is closely related).

There are many ways to generalize this idea to a set of $n$ vectors, but as requested in the comments, here I will expand on one possibility, related to the idea of (linear) dimensionality reduction.

First, in the $n=2$ case, we can think of $R^2\in[0,1]$ as the "fraction of the variance explained" by a linear regression. For the squared cosine similarity, a similar interpretation is possible, except the associated regression has no "intercept term" (i.e. the vectors are scaled, but not centered).

In the $n$ dimensional case, we can concatenate the vectors into a matrix $$X=\begin{bmatrix}x_1 \, \ldots \, x_n\end{bmatrix}$$ analogous to the cosine similarity case, we can consider dot products based on the scaled matrix $$Y=\begin{bmatrix}y_1 \, \ldots \, y_n\end{bmatrix}$$ assembled from the corresponding unit vectors $$y_i=\frac{x_i}{\|x_i\|}$$ In general we could then conduct varying analyses on the "correlation matrix" $Y^TY$, including dimension reduction via PCA.

The general idea here is to create a low-rank approximation to the matrix $Y$ by truncating its singular value decomposition (SVD). Mathematically, we have $$Y=USV^T=\sum_{k=1}^n\sigma_ku_kv_k^T$$ where the singular values $\sigma_1,\ldots,\sigma_n$ are non-negative and arranged in decreasing order. (For simplicity I am assuming that for $x_i\in\mathbb{R}^m$ the number of vectors is at least $n\geq m$. Otherwise the sum will have $m$ terms.) If we truncate the sum at $K<n$, then we get a "rank $K$ approximation" $\hat{Y}_K$ to the matrix $Y=\hat{Y}_n$. The truncated SVD is optimal in the sense that it minimizes the Frobenius norm of the reconstruction error $\|\hat{Y}_K-Y\|_F^2$.

The squared Frobenius norm is just the sum of the squares of all the entries of a matrix. In particular, for the scaled matrix $Y$ we will always have $$\|Y\|_F^2=n$$ For any matrix, the Frobenius norm is also equal to the sum of the squares of the singular values. In particular, for the $K$-rank approximation $\hat{Y}_K$ we have $$\|\hat{Y}_K\|_F^2=\sum_{k=1}^K\sigma_k^2$$ So we can define an "order $K$ coefficient of determination" as $$R_K^2\equiv\frac{\|\hat{Y}_K\|_F^2}{\|Y\|_F^2}\in[0,1]$$

A simple option would be to just use the first singular value, i.e. $$R_1^2\equiv\frac{\sigma_1^2}{n}$$ The maximum similarity would be for $n$ parallel vectors (i.e. scalar multiples of a single vector), giving $R_1^2=1$. The minimum similarity would be for $n$ orthogonal vectors, giving $R_1^2=\frac{1}{n}$.

A few final notes:

  • The first singular value $\sigma_1$ can be computed via an iterative method, so no detailed SVD need be computed. (Similar to PageRank; e.g. svds(Y,1) in Matlab.)
  • To get a similarity that can range over the entire [0,1] interval, you could simply normalize, i.e. $$\hat{R}_1^2\equiv\frac{\sigma_1^2-1}{n-1}$$
  • So in Matlab the similarity function could be defined via CosSimN = @(Y) (svds(Y,1)^2-1)/(size(Y,2)-1), if we assume that $Y$ is already available as input.
  • For $n=2$ the above will reduce to the absolute value of the standard cosine similarity.
GeoMatt22
  • 11,997
  • 2
  • 34
  • 64
  • thank you for your elaborate answer. Does the equation CosSimN = @(Y) (svds(Y,1)^2-1)/(size(Y,2)-1) also hold for n > 2 and possibly negative data? If not, is there a way to support negative data? – D. Phi Dec 10 '20 at 13:12
  • 1
    @D.Phi sorry for the confusion. My proposed $n$-vector similarity measure $\hat{R}_1^2$ can be applied to any collection of vectors. (The only limitation is that no vector is entirely zeros, which also applies to cosine similarity.) For the special case of $n=2$, the result will just be the absolute value of the cosine similarity. So $n > 2$ and negative data are no problem. – GeoMatt22 Dec 10 '20 at 17:44
  • Thanks, I appreciate your answer. I implemented your code and found an error. You divide by the column count of Y instead of n (which is row count * column count). Here is the code which works for me: CosSimN = @(Y) (svds(Y,1)^2-1) / (prod(size(Y)) - 1) – D. Phi Dec 14 '20 at 19:11
  • I've noticed that your proposed formula does not work well on vectors with many zeros. E.g. if Y = [[1,0,0],[1,0,0],[1,0,0]], then CosSimN(Y) = 0.25, even though it should be 1. Is there a benefit to this approach in comparison to the average of pairwise cosine similarities? – D. Phi Dec 14 '20 at 19:53
  • @D.Phi my comment "no vector is entirely zeros" was referring to instances like your second case. The Cosine Similarity is similarly undefined for all-zero vectors (i.e. it requires unit vectors, or normalizable ones at least). – GeoMatt22 Dec 14 '20 at 20:13
  • 1
    @D.Phi as for the size: As noted in my answer, the sum of squares of singular values by definition equals the sum of squares of the matrix entries (squared Frobenius norm). If you input a $Y$ with normalized (unit-vector) columns, then the sum of squares of $Y$'s entries will be the number of columns. This is the max you can get for the square of $Y$'s first singular value. (If you do not normalize the columns first, this would not hold.) As for "What is the benefit of this approach?", as I mentioned in my initial comment to OP, this is simply one plausible approach to generalizing CosSim. – GeoMatt22 Dec 14 '20 at 20:19