0

I'm trying to understand, in layman's terms, what it is that pearson correlation coefficient is actually measuring and how it's used as a measure of similarity between two vectors. I don't have much of a statistics background, but from what I've gathered so far, what this metric does is...

  1. pair together each index and determine to produce a set of 2D points.

    vector A = (1, 6, 5, 9)
    vector B = (3, 8, 4, 8)
                |  |  |  |
                |  |  |  '-- T=(9, 8)
                |  |  '----- R=(5, 4)
                |  '-------- E=(6, 8)
                '----------- W=(1, 3)
    
  2. fit a straight line to those points.

    vector A = (1, 6, 5, 9)
    vector B = (3, 8, 4, 8)
                |  |  |  |
                |  |  |  '-- T=(9, 8)
                |  |  '----- R=(5, 4)
                |  '-------- E=(6, 8)
                '----------- W=(1, 3)
    
    9|                .'
    8|           E  .' T 
    7|            .'   
    6|          .'      
    5|        .'        
    4|      .' R           
    3| W  .'            
    2|  .'              
    1|.'                
     +-------------------
       1 2 3 4 5 6 7 8 9
    
  3. quantify the proximity of those points to the fitted line, where the proximity of larger points contribute more vs smaller points. The quantity ranges from 0.0 (loose proximity) and 1.0 (tight proximity), negating the if the slope of the fitted line is negative.

    vector A = (0, 1, 1, 0, 5,  2, 1, 6)
    vector B = (1, 0, 2, 0, 10, 0, 2, 14)
    
            r = ~0.95
    
    15|               /    
      |             ●/  
    13|             /    
      |            /    
    11|           /
      |          /●
     9|         /
      |        /
     7|       /
      |      /
     5|     /
      |    /
     3|   /
      |  /●              
     1|●/   ●             
      |●  ●
      +----------------------------------
          1   3   5   7   9   11  13  15
    

Assuming I'm somewhat correct, pearson correlation coefficient will ignore jitter in small components while checking if big components linearly scale..

   vector A = (0, 1, 1, 0, 5,  2, 1, 6)
   vector B = (1, 0, 2, 0, 10, 0, 2, 14)

                                  :
                                  :
                            :     :
                            :     :
        .     :             :     :
        :     :             :     :
  . .   : : . :     .   : . :   : :
0 1 2 3 4 5 6 7     0 1 2 3 4 5 6 7
vecA components     vecB components

In the example above, the larger components scale roughly 2x from vecA to vecB, so PCC is giving back a result close to 1.0. The "similarity" being measured here is how linearly large components scale between vectors. Is this correct?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
offbynull
  • 101
  • 1

0 Answers0