1

A lot of resources say that if you have over a certain number of cases, you can go ahead and use parametric tests. However, I have been struggling for a while to understand this, and when to test it. For example, I have student achievement data, which has a long tail on the right.

Am I correct in thinking that when the data is skewed to a certain degree, it is not normal and no parametric tests can be used? OR do the assumptions required to conduct parametric tests have nothing to do with the distribution of that particular variable?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Raya M
  • 11
  • 1
  • 2
    Search this site for much written on the subject. The CLT has no protection for type II error. – Frank Harrell Nov 20 '17 at 22:02
  • 2
    The answer is, unfortunately, "that depends on both the test and the data". For example, with respect to your student achievement data, if your sample size is large enough, sample means of say male / female students will be approximately Normally distributed and a t-test / z-test of difference between means will work well. But "large enough" is situational. – jbowman Nov 20 '17 at 22:06
  • 1
    This series of questions has substantial overlap with a number of questions already on site; I answered it because I think it touched on just enough issues that it didn't quite constitute a duplicate of the questions I located but it may well close as a duplicate of a pre-existing question if I get time to look harder. – Glen_b Nov 20 '17 at 22:18

1 Answers1

2

A lot of resources say that if you have over a certain number of cases, you can go ahead and use parametric tests.

It depends on exactly what they say but if they offer a specific number they're probably wrong.

However, I have been struggling for a while to understand this, and when to test it. For example, I have student achievement data, which has a long tail on the right. Am I correct in thinking that when the data is skewed to a certain degree, it is not normal

Substantial skewness in a sufficiently large random sample would certainly rule out normality.

and no parametric tests can be used?

This common tendency in books (particularly in certain application areas) to equate "parametric tests" only with those that assume normality is flat out wrong.

The term "parametric" essentially means "defined by a fixed number of parameters"; all manner of parametric tests make a parametric distributional assumption that is not an assumption of normality.

If I assume a collection of (right skewed) survival times are drawn from an exponential distribution, that's a parametric assumption and I can use that to obtain a parametric test that has nothing to do with assuming normality. People who use GLMs make non-normal parametric assumptions. When people in finance assume t-distributed errors in GARCH models for log-returns in the stock market, they make a non-normal parametric assumption.

Let's now focus on a better-framed version of the issue:

If my distribution is not normal, does it preclude me using a test that assumes normality?

The answer to that is "not necessarily". It depends on both the test and on the underlying population distribution, as well as how much impact on significance level and power you're prepared to tolerate.

For example: in very large samples, a t-test for a difference in means may have very close to the right significance level, even with reasonably strong population skewness; it may suffer from a loss of power compared to a more appropriate distributional assumption (which may or may not be of concern to you, depending on the situation).

[Of course you may still be able to test means without using that specific distributional assumption if that's what you are interested in; you can even have a nonparametric test for a difference in means if you wish.]

On the other hand, a variance-ratio F-test might be unsuitable at any sample size, with exactly the same distribution.

OR do the assumptions required to conduct parametric tests have nothing to do with the distribution of that particular variable?

Certainly they have something to do with the population distribution.

(While you didn't raise this issue, I'll mention it for later readers -- one thing to consider is that you probably don't want to formally test the distributional assumption)

Glen_b
  • 257,508
  • 32
  • 553
  • 939