How to determine effect size in addition to significance in statistical inference tests that do not explicitly estimate effect size?

Question

Students in linear regression courses are taught that more data is good. They're taught that checking assumptions are good. They're taught that the Shapiro-Wilk test is good. Then they're taught that having a lot of data and testing a normality assumption with the Shapiro-Wilk test is bad, because it is too good at detecting deviations from normality when you have a large sample. When you look at the mathematics of the test, this makes perfect sense. But if you take a step back from the mathematics, you may begin to realize how utterly farcical the situation that we've created here is. It's like the punchline of a joke about mathematicians and assumptions.

In a similar vein to the famous aphorism "all models are wrong, but some are useful," I think we can also admit "most assumptions in statistics are wrong, but some are close enough." With that in mind, is there a way to quantify the practical relevance of a test result in a way that checks a "close-enough" assumption? I'd like to avoid the farce of "we can't trust our tests, because they're too accurate."

EDIT 2: This question is being misunderstood as a question about model assumptions or normality assumptions, but it is not. Here's an attempt to clarify the question. The power of statistical tests is the ability of the test to detect deviation from the null hypothesis. The power of tests increases with sample size. If the sample size is huge, even a trivial deviation from the null could lead to "statistically significant" results. If the effect size is small, we may not care that the test is statistically significant. This is not a problem for tests that are performed within a model, e.g. a t-test on a coefficient in OLS, because we also have an estimate of the effect size.

In many other statistical tests, we do not have an estimate of the effect size. Moreover, we already know that the null hypothesis in many tests is false (e.g. we know that real data is never perfectly normal, we know that two populations are never perfectly identical), and what we really care about is the effect size. If the sample size is small or medium, we can imitate knowing the effect size, because if the power of the test is low or moderate, it won't detect small departures from the null. In such a situation, a test finding statistical significance can be interpreted as a rough proxy for a significantly sized effect, because that's the only way the effect could have been detected with a low power test. That scheme breaks down if the sample is large, because the power is so high that any tiny departure from the null is detected and considered statistically significant. In that latter case, statistical significance does not contribute anything to our understanding of the problem we're trying to address.

The only purpose of my question is to determine if there are currently existing ways to estimate the effect size of such tests so that they can still be used with large samples.

Is Shapiro-Wilk good? For what purpose, checking the normality of residuals? That is considered [barely important at all](https://stats.stackexchange.com/questions/152674/why-is-the-normality-of-residuals-barely-important-at-all-for-the-purpose-of-e/152681#152681) by many. — alan ocallaghan, Nov 05 '21 at 14:55
So that we might understand your question better, could you name a specific test that has no corresponding "effect size"? Indeed, it will be important to explain what you mean by "effect size," because your ultimate question makes little sense in light of standard definitions: the effect size is a property of the underlying population or data generation process, *not* of any sample or sample size. — whuber, Nov 05 '21 at 18:16
@whuber I have distribution-comparison tests in mind. Shapiro-Wilk, Kolmogorov-Smirnov, Wilcoxon Rank-Sum... They all test null hypotheses that we already know are almost certainly not literally true. This makes them useless for large samples, because at high power they reject for any trivial departure from the null and don't quantify the departure. I'm looking for a test that provides some quantitative estimate of the extent to which the distributions differ. I struggle to accept that such a rigorous field has no response to this beyond "look at a graph. If they look the same, they are." — AJV, Nov 05 '21 at 18:39
@AJV KS explicitly measures the vertical distance between the empirical and theoretical CDF. That would be an effect size, agreed? — Dave, Nov 05 '21 at 18:41
@Dave, It seems like that could be an analogue of effect size, yes. If it is a reliable effect size, that works for large samples, then why isn't it used more often? I don't know the answers to any of these questions, I've just heard "plot it and check a graph" or "you can't use Shapiro-Wilk with large samples, you can only check a Q-Q plot and make an educated guess." These seems like strangely informal things to be taught in graduate-level statistics courses if there are formal methods to address the issues. — AJV, Nov 05 '21 at 18:50
You still seem to be confounding sample-size effects, testing size, power, and effect size. That makes it difficult to determine what you might mean by "the issues." There's nothing really special about the distributional and nonparametric tests you mention: as @Dave suggests, those tests have natural effect sizes and, like *all* hypothesis tests, become more powerful with larger sample sizes. It's hard to know what you might mean by "strangely informal things:" that sounds like a misunderstanding or a straw man argument. — whuber, Nov 05 '21 at 19:26
@Dave It is possible that my question stems from a lack of understanding of these tests more than anything else. I was taught that Shapiro-Wilk is a good test to check for normality on a small sample, but that it can't be used on large samples and we should look at plots instead. I heard similar comments regarding K-S, but did not go over the K-S test in detail. Is Shapiro-Wilk's inappropriateness for large samples unique, and I have mistakenly taken it for a general rule for distribution comparison tests? Because honestly, it does looks like distance in K-S provides the information I wanted. — AJV, Nov 05 '21 at 20:17
@AJV You might be interested in [a discussion on here about if normality testing is essentially useless.](https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless) // In some sense, those criticisms of normality testing apply to all hypothesis testing. — Dave, Nov 05 '21 at 20:57

Nuclear Hoagie · Answer 1 · 2021-11-05T20:56:31.013

2

You're describing measures of effect size rather than statistical significance. Significance is often used to describe likelihood of there being exactly zero effect. With enough data, you will often find that effects are not truly zero, but are too small to actually care about. For example, one could conduct an enormous clinical trial to show that Drug A is statistically significantly better than Drug B, indicating that the drugs do not have identical efficacy. But if you look at differences in survival, you might find that Drug A improves survival times by 5 minutes - the significance values allows you to conclude that the survival difference is almost certainly not 0 minutes, but examination of survival differences shows no real practical benefit. There are many ways to quantify effect size depending on corresponding statistical test, which could include differences in survival or differences in mean/median, fold-changes between values, etc. Effect sizes allow you to quantify the size of the differences observed, rather than just checking whether it is feasibly zero or not. What you'd define as a "meaningful" effect size is highly domain-dependent, however.

As we've discussed in the comments, there's no single measure of effect size for normality tests, since a distribution can deviate from normal in an infinite number of ways. Effect sizes typically capture differences in a single parameter (mean, survival times, etc), but there is no single parameter that captures normality. One can quantify particular aspects of a distribution like skewness or kurtosis and compare against those of a true normal distribution, but I'm not aware of a single effect measure that can quantify how non-normal a distribution is.

edited Nov 05 '21 at 20:56

answered Nov 05 '21 at 14:28

Nuclear Hoagie

5,553
16
24

My question is pertaining to the quantification of effect sizes when the test is NOT part of a model. Following the question, the Shapiro-Wilk test is virtually sure to asymptotically reject the normality assumption on real data, but it does NOT include any estimate of effect size--there is no intuitive way that I see to interpret *how much* the data deviate from the assumption, even though that's what we truly want to know. I'm asking if there is a way to derive something analogous to effect size on such a test. – AJV Nov 05 '21 at 14:44
1

@AJV Good question. It seems there is no widely accepted measure of effect size for normality tests, since there's no single dimension of "effect" to measure on, see https://stats.stackexchange.com/questions/289136/is-the-shapiro-wilk-test-w-an-effect-size. I suppose you could do a qualitative, heuristic measure of effect by looking at the distribution and asking "does this look normal to me?" Imprecise, but if a distribution looks normal by eye, one could argue it's a small effect compared to an obviously skewed distribution, for example. – Nuclear Hoagie Nov 05 '21 at 14:57
Yes, and that is what people tend to do, I think. Sticking to checks of normality, if the data is very large, they will just inspect a Q-Q plot. This works well, but it is not formal. I just find it unusual that there are so many tests that seem to fail on large samples, and there is no alternative except using rules of thumb and visual inspections. – AJV Nov 05 '21 at 15:17
1

@AJV You could quantitatively measure *aspects* of normality like skewness or kurtosis, and get a numerical value which could be compared against the values seen in a truly normal distribution. With lots of data, you might find that the SE around the skewness and kurtosis values suggest non-normality, but that the values themselves are very close to a normal distribution's. I think this would be a reasonable measure of effect size for particular ways in which a distribution might be non-normal, but it wouldn't capture all possible deviations from normality. – Nuclear Hoagie Nov 05 '21 at 15:28
1

Some of the difficulty with effect size when it comes to full distribution comparisons is that there are so many ways for the distribution to differ from the expectation. If you want to check something about the mean, $\mu = \mu_0$, either $\mu=\mu_0$, $\mu>\mu_0$ or $\mu – Dave Nov 05 '21 at 15:33

Christian Hennig · Answer 2 · 2021-11-05T22:48:02.563

Some things to know about model assumptions:

Models are thought constructs and are by construction different from the real world. No model assumption is ever fulfilled, so I'm with Box on this one. Models are tools for thinking.
This also implies that no model is ever in any well defined sense "approximately fulfilled". As real data never are truly i.i.d., for example, finding small distances such as Kolmogorov-Smirnov between an empirical and an assumed distribution does not imply that the model assumption is approximately fulfilled.
That a certain method has a model assumption means that assuming the model it was shown to have some favourable properties. This is fine, as models are our favourite tool to make sure a method makes sense. It does not imply at all that in order for the method to work the model must be true or even approximately true.
Unfortunately there are situations in which a method will give misleading results, so we can't just rely on a method that is optimal under a certain model, or even a method that has been derived without parametric assumptions (the latter will still assume i.i.d. or similar, which does not hold in reality). What is of real importance is not making sure that the model assumptions (approximately) hold, but rather checking whether there is any indication that the result of the method will be misleading.
Model-based theory helps to some extent with this, as it gives us an idealised idea of when the method will get it right. It can also give us ideas about when a method can get it wrong, but this requires analysing what happens if the method is applied to data from certain wrong models. Robustness theory does some of this.
Whether a method will be misleading or not is an issue that is essentially different from what "distance"/misspecification test value we get under a formal test of a certain assumption. Quite generally, the danger of a misleading result when using a normality-based method on real data is not proportional to the strength of indication of these tests against normality. The problem with Shapiro-Wilks is not that it is "too good for large samples", but rather that diagnosing whether data mislead a model-based method is a task essentially different from testing a certain distributional shape. For example, most normality-based methods are just fine with distributions with light tails, even though such distributions may look very non-normal, whereas a t-distribution with low degrees of freedom is much harder to detect but will mess up most normality-based inference. Unfortunately it is hard to give general guidelines, as what is problematic depends on the method to be applied, but also on how results are interpreted. In my view it is a general problem in statistics that model assumptions are taken too literally, and there isn't much work helping with the real problem, which is diagnosing what may go wrong, rather than rejection (or not) of an assumed model.

Somebody (Jan de Leeuw ?) said we should say *ideal assumptions*, not only assumptions, to underline that this describes only ideal cases which might make it possible to analyze properties formally, but that should not be thought about as necessary to be fulfilled in practice, always, only approximately. But google gives me nothing ... — kjetil b halvorsen, Nov 05 '21 at 21:35

How to determine effect size in addition to significance in statistical inference tests that do not explicitly estimate effect size?

2 Answers2