4

I need to identify a relationship between two variables. I have two sets of measures performed by two competing systems, and I would like to compare how close the two systems are to each other.

Here is an example of how the data look (dummy values):

╔════════╦═══════════╦═══════════╗  
║ object ║ measure-1 ║ measure-2 ║  
╠════════╬═══════════╬═══════════╣  
║ obj1   ║         0 ║       120 ║  
║ obj2   ║         1 ║        60 ║
║ obj3   ║         0 ║        20 ║    
║ obj4   ║         4 ║       240 ║  
╚════════╩═══════════╩═══════════╝  

There is no information available on how the measures have been done, and the scale they use, but it is very likely that one is a linear data series and the second is built on a logarithmic scale. I noticed this by plotting the series in a scatterplot. Using linear scale for both X and Y, the relationship is not linear, as the curve grows quickly then flattens while X increases. If I tick the checkbox 'logarithmic scale' on chart axis Y though, the dots are quite aligned.

My first question is whether this tends to show that series X is using a logarithmic scale?

My second question is about finding the appropriate way to check how far series are similar to each other.

My guess is that I have to compute the correlation coefficient between both series (i.e. get a normalized measure of how close the two sets of measures are), but I think it is first needed to process one of the data series to re-scale it appropriately, before using spreadsheet's CORREL function. I have tried applying log() to one of the series and computing the correlation, but it does not look to be providing relevant results (coefficient is near 0.5, despite the strong alignment of the dots).

I have read about Pearson's rank correlation, but I am not sure of how to calculate it in Excel.

mdewey
  • 16,541
  • 22
  • 30
  • 57
user27326
  • 43
  • 1
  • 5
  • 3
    Normally one does not set out in a data analysis to compute a correlation coefficient: one has specific analytical objectives, such as to identify a relationship among variables, or make a prediction, or estimate something. Computing a correlation coefficient is just a technique used to get you there. Whether it should be computed at all, and if so how it should be computed, are questions best decided within that larger context. Would you mind, then, editing your question to share some information about your objectives and what you hope to achieve with this calculation? – whuber Jun 26 '13 at 13:38
  • @whuber thanks I have just edited the post to clarify my goal. – user27326 Jun 26 '13 at 14:05
  • 1
    If you get a scatter on a scatter plot that is approximately straight, Pearson's correlation can be computed. If you have marked curvature, it is likely that a rank correlation is a better descriptor, but the usual rank correlation measures are named for Spearman and Kendall (_not_ Pearson). Either way, calculating a correlation is at best a step towards modelling. – Nick Cox Jun 26 '13 at 14:43
  • It will help you to think in a quantitative way about what you really mean by "close" or "similar." For instance, a Spearman correlation coefficient will measure (to some extent) how *consistent* the two measurements are, but will not really tell you whether they are sometimes far apart and sometimes close together (even after some initial transformation). Articulating the *purpose* of your comparison may help with this thinking. Exactly how do you intend to use your result once you get it? Calibration? Choosing among the two systems? Something else? – whuber Jun 26 '13 at 14:44
  • the purpose is to know whether we can use the second system of measures instead of first one, and get similar results. We kind of need to compare the quality of the second measure system based on the first one. – user27326 Jun 26 '13 at 14:53
  • 1
    This problem is called *calibration.* It often is solved using "inverse regression." This procedure obtains a formula to predict the new measurements in terms of the old, and then inverts it to estimate what the old measurement might have been, given a new measurement. (This obviously is more useful than a mere correlation coefficient.) Some posts on the topic can be found by [searching our site](http://stats.stackexchange.com/search?tab=relevance&q=calibration%20measurement%20regression). A recent discussion occurred at http://stats.stackexchange.com/questions/61989. – whuber Jun 26 '13 at 16:33

1 Answers1

6

Why don't you use the non-parametric Spearman's rank correlation coefficient? This would neatly circumvent the need to worry about data transformation. In essence it is almost the same as normal (Pearson's) correlation, but calculated on the ranks of the values, and not the values themselves. Since log is a monotonics function, the ranks will remain the same no matter whether or not you take logarithms of the values.

Since you are using a spreadsheet, look here.

As for your first question, the problem is I think more complex, as you would have to check whether a log-linear fit is significantly better than a linear fit. I have no idea whether this can be done simply in a spreadsheet. You could, of course, take logarithms of one of the variables and check whether the Pearson's correlation coefficent is better, but this is not really always a good idea.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
January
  • 6,999
  • 1
  • 32
  • 55
  • Thanks. As my dataset has duplicates, I don't think the method explained in your link will works, based on [Wikipedia's article on the subject](http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient). However if it makes sense I will use themore complex formula to calculate the coefficient. – user27326 Jun 26 '13 at 15:34
  • Then calculate the ranks and then the Pearson's correlation coefficient of the ranks. This is the same as rho. – January Jun 26 '13 at 16:35