1

I have four datasets: A1, A2, B1, B2. Every dataset has between 100-300 items.

Every item in every dataset has two values: x, y

The goal:

  1. Find what datasets have similar x values.
  2. If the datasets have similar x values, are their correlations between x and y similar? And vice-versa.

With t-test for x values I found out, that A1 and A2 are not too different (mean value is not significantly different). The same thing stands for B1, B2. But every of A datasets is significantly different than any of B datasets. In list

  • A1.x and A2.x - similar
  • B1.x and B2.x - similar
  • A1.x and (B1.x or B2.x) - different
  • A2.x and (B1.x or B2.x) - different

Now I am interested, if the correlation between x and y in dataset, is the same for A1 and A2, while it is different for correlation of B1 and B2 (what should be the same again). I calculated this correlations and I got:

  • correlation of A1.x and A1.y = 0.487
  • correlation of A2.x and A2.y = 0.460
  • correlation of B1.x and B1.y = 0.598
  • correlation of B2.x and B2.y = 0.610

Main question: What test I should use, to measure how significant is this similarity / difference in the correlations? Because it probably still could be just coincidence.

Other question: Is the t-test good way how to estimate if two datasets comes from the same precess? Should I do it also for y values in this case?

I hope it is clear what I need. If not, please comment what is unclear, I will do my best to explain.

matousc
  • 155
  • 5

1 Answers1

1

Diedenhofen & Musch (2015, PLoS ONE) discuss various tests for significant differences between measured correlations, with pointers to literature. They also discuss confidence intervals. Unfortunately, the companion cocor package for R was removed from CRAN - apparently it failed automated checks during an R upgrade, and the authors did not address these issues in a timely manner.

Regarding your other question, it depends on what you are interested in. If you are only interested in whether the $x$ distributions have the same mean, a t test is appropriate. (Assuming equal or different variances, as the case may be.) You could also test whether variances are equal, e.g., using an F test. Alternatively, you could use a two-sample Kolmogorov-Smirnov test to assess whether the two samples come from the same underlying distribution.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Ok, and what about the meaning of the correlations? Is it enough "to prove" that datasets are probably from the same "origin", or do I need some other test ("to prove" something about correlations)? – matousc May 25 '16 at 12:12
  • You are asking a question that goes to the heart of null hypothesis significance testing. NHST can never *prove* anything. It will always only check whether your data are consistent with a default "null hypothesis" - in your case, that the two $x$ vectors come from the same population, respectively that the population correlations are equal. Yes, this is a problem. You may want to browse through [questions tagged "significance-testing"](http://stats.stackexchange.com/questions/tagged/statistical-significance?sort=votes&pageSize=50), or consider Bayesian approaches. – Stephan Kolassa May 25 '16 at 16:02