1

I have a dataset which contains vectors of different features that generated from subtitles in movies, something like:

          Comedy     Disaster  Romance...
Movie1  0.037283    0.28866    0.36253
Movie2  ...................

I want to use cosine similarity, but before that, I tried to scale the vectors by row and also tried to normalise the data by column, and the similarity results are different, I don't know what is the difference, is there any paper I can have a look?

I used 'scale()' to scale the vectors by row, and tried 'preprocessing.MinMaxScaler()' to normalise the data bu each column, not sure if it is correct. Could someone tell me the difference and which method is better for my case?

Many thanks.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Cecilia
  • 41
  • 1
  • 5
  • The two vectors between which you are computing the similarity - they must by normalized. Not standardized, only normalized to unit sums of squares. – ttnphns Jul 20 '19 at 21:02
  • @ttnphns isn't the cosine similarity already calculated with the normalized versions them? – gunes Jul 20 '19 at 21:04
  • @gunes, yes, it is sum of crossproduct of such vectors. Actually it I what I meant. – ttnphns Jul 20 '19 at 22:03
  • @ttnphns I calculated cosine similarity between two rows, so I need to normalize them by row right? But should I use scale() or preprocessing.MinMaxScaler()? – Cecilia Jul 21 '19 at 09:40
  • @ttnphns and why they must be normalized, not standardized?According to the cosine similarity formula, it should be between -1 and 1, and all the similarities I got are already between 0-1, does that mean the numbers are already normalised? It is nesessary to normalise by row again? – Cecilia Jul 21 '19 at 10:15
  • Normalization of _sum-of-squares_ (SS) to 1 (i.e. L2 normalization). Compute SS in a vector (row) and multiply all its values by 1/sqrt(SS). After both vectors are such normalized, compute sum of cross product of them, it is the cosine. – ttnphns Jul 21 '19 at 12:07
  • Or you can compute the cosine directly bypassing the normalization (the normalization is "hidden" in the formula of cosine similarity) https://stats.stackexchange.com/a/22520/3277 – ttnphns Jul 21 '19 at 12:08

0 Answers0