2

Let's say I have two variables I want to run correlation between (variables A and B). I have measures of each variable for the same 25 subjects. On top of that for variable B, I have data for an extra 10 subjects (so for variable A, n = 25, and for variable B, n = 35). So when I run the correlation analysis, I'll be including the 25 subjects in common for each variable. Before running correlation I want to test normality assumption for each variable. My question is: when I test the assumption for variable B, do I include in the test the full 35 subjects or only the 25 I plan to include in the correlation analysis?

Thanks, FBH

Jeremy Miles
  • 13,917
  • 6
  • 30
  • 64
  • 5
    Why do you want to test your data for normality? – Stephan Kolassa Sep 07 '21 at 16:48
  • 4
    [Normality testing is less helpful than one might hope.](https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless) – Dave Sep 07 '21 at 16:50
  • I was told I need to do so by teacher... so even if it's useless, I still have to do it... – FastBallooningHead Sep 07 '21 at 16:51
  • 1
    "Teacher, what is the purpose if a high p-value is not evidence of normality (it isn't) and a low p-value does not indicate practical significance (it doesn't)?" That doesn't get you out of having to do your homework question, but that would be worth asking. (I am curious to know your teacher's response.) – Dave Sep 07 '21 at 16:55
  • Makes sense... but let's say that we replace correlation with some test that actually demands normality assumption being met. What would the answer be to my question? – FastBallooningHead Sep 07 '21 at 17:01
  • How close to normal do your distributions have to be? For instance, is it an issue that the computer on which you perform the calculations cannot handle irrational numbers and, therefore, cannot deal with a perfectly normal distribution? // I think it is valid to ask this question about which points to include in the kind of graphical analysis described in my link. – Dave Sep 07 '21 at 17:04
  • I would simply just like to run Shapiro-wilk and if the result isn't significant, I'll say it's normal. – FastBallooningHead Sep 07 '21 at 17:06
  • 1
    Then you are committing a (very common) statistical error in interpreting a large p-value. That this mistake is common does not change the fact that it is a mistake. – Dave Sep 07 '21 at 17:08
  • That's fine, it's just an assignment. So do I run Shapiro-wilk on the 35 or 25 group for variable B? – FastBallooningHead Sep 07 '21 at 17:22
  • Perhaps you can edit the original post to include your arguments for using the restricted set of $25$ and the full set of $35$. – Dave Sep 07 '21 at 17:24
  • I don't have much logic for either. I've just not encountered this situation before, and I'm wondering what to do. I guess my inclination would be to use the set of 25 since that's what's ultimately going into the correlation analysis, which requires the normality assumption be met (*or so we're accepting for this case*) – FastBallooningHead Sep 07 '21 at 17:29

1 Answers1

1

I can see arguments for each.

25 is the easy one to argue. You are calculating the correlation using only those $25$ points. Who cares what wacky behavior those other points have? They are not part of the analysis.

35 also can be defended. If you believe all $35$ points come from the same population (likely due to your knowledge of the experiment), you can tighten up your analysis by using the larger sample size. If you run a formal hypothesis test, you will have greater power to reject a false null hypothesis.

Dave
  • 28,473
  • 4
  • 52
  • 104