What is effect size... and why is it even useful?

Question

I have an introductory-graduate-level statistics background (assume I know mathematical statistics and probability at an undergraduate level (e.g., Wackerly et al., Ross' Probability), and have some knowledge of measure theory).

I have recently started a job doing experimental design and statistical reporting in education statistics, and have been placed on a project where I am basically assessing accountability metrics for schools and have to analyze the data, propose changes, etc. Note that I am the only one in my department with a mathematical statistics background.

In my position, people have strongly suggested using effect size to measure effectiveness of programs. The only time I've ever heard of effect size is from my friend, who studied psychology. My impression is that $$\text{Effect Size} = \dfrac{\text{Difference of Means}}{\text{Standard Deviation}}\text{.}$$

What is so useful about this metric over traditional hypothesis testing, and why should I care about it? To me, it looks like nothing more than a test statistic for a two-sample $t$-test. I don't see this useful at all apart from maybe putting everything on the same scale (which is why anyone really "normalizes" anything), but I thought test statistics (which is what effect size seems like to me) were out of fashion, and $p$-values are preferred.

I'm a little confused by "introductory-graduate-level statistics background"; the first two terms seem to contradict each other. Can you clarify what that includes? Is that something like the start of graduate level statistics or something else? — Glen_b, Nov 17 '15 at 00:42
@Glen_b Yes, it is beginning graduate-level statistics. Assume I know mathematical statistics and probability at an undergraduate level (e.g., Wackerly et al., Ross' Probability), and have some knowledge of measure theory. — Clarinetist, Nov 17 '15 at 00:49
I can sympathize, OP. Coming from a math/stat background, it was often bewildering to discuss statistics with those trained in sociology or psychology PhD programs, because they have different terms for everything :) and sometimes have rigid ideas about exactly how to do things, regardless of whether it's the best statistical practice, e.g. trying convincing a stubborn reviewer/editor that structural equation modeling is not the solution to all problems, or that linearity is not always a good assumption! However, I've learned to coalesce with that community quite well, after several years! — CrockGill, Nov 17 '15 at 01:02

gung - Reinstate Monica · Accepted Answer · 2015-11-17T13:06:56.350

That is one measure of effect size, but there are many others. It is certainly not the $t$ test statistic. Your measure of effect size is often called Cohen's $d$ (strictly speaking that is correct only if the SD is estimated via MLE—i.e., without Bessel's correction); more generically, it is called the 'standardized mean difference'. Perhaps this will make it clearer that $t\ne d$:
\begin{align} d &= \frac{\bar x_2 - \bar x_1}{SD} \\[10pt] &\ne \\[10pt] t &= \frac{\bar x_2 - \bar x_1}{SE} \\[10pt] t &= \frac{\bar x_2 - \bar x_1}{\frac{SD}{\sqrt N}} \\ \end{align} That is, the "$/\sqrt N$" is missing from the formula for the standardized mean difference.

More generally, taking the sample size out of the value provides real information. Assuming the true effect is not exactly $0$ to infinite decimal places, you can achieve any level of significance you might like with sufficient $N$. The $p$-value provides information about how confident we can be in rejecting the null hypothesis, but does so by conflating how big the effect is with how much data you have. It is certainly nice to know if we should reject the null hypothesis, but it would also be nice to know if the effect of your educational intervention produces large gains for schoolchildren or is trivial and was only significant due to large $N$.

score 16 · Answer 2 · answered Nov 17 '15 at 00:57

I expect someone with a background in a more relevant area (psychology or education, say) will chime in with a better answer, but I'll give it a shot.

"Effect size" is a term with more than one meaning -- which many years past led some some confused conversations until I eventually came to that realization. Here we're clearly dealing with the scaled-for-standard-deviation version ("how many standard deviations did that change by?")

Part of the reason for looking at that sort of "effect size" in the subject areas they're common in is that they frequently have variables whose particular values are not inherently meaningful but are constructed to attempt to measure some underlying thing that's hard to get at.

For example, imagine you are trying to measure job satisfaction (perhaps for a model which relates it to some set of independent variables, perhaps including some treatment of interest, for example). You don't have any way to get at it directly, but you could (for example) try to construct some questionnaire to get at different aspects of it, perhaps using something like a Likert scale.

A different researcher may have a different approach to measuring job satisfaction, and so your two sets of "Satisfaction" measurements are not directly comparable -- but if they have the various forms of validity and so on that these things are checked for (so that they may reasonably be measuring satisfaction), then they may be hoped to have very similar effect sizes; at the least effect size is going to be more nearly comparable.

does a very nice job of introducing the idea of a 'construct' without technicalities. But in your work, Clarinetist, you will need to understand this idea in some depth. I highly recommend the original source on 'construct validity', Cronbach & Meehl's 1955 article in the Psychological Bulletin: http://psych.colorado.edu/~willcutt/pdfs/Cronbach_1955.pdf — David C. Norris, Nov 17 '15 at 13:36

score 7 · Answer 3 · answered Nov 17 '15 at 00:59

The formula above is how you calculate Cohen's d for related samples (which is probably what you have?), if they're unrelated you can use pooled variance instead. There are different stats that will tell you about effect size, but Cohen's d is a standardised measure that can vary between 0 and 3. If you have lots of different variables, it can be nice to have a standardised measure when you're thinking about them all together. On the other hand, many people prefer understanding the effect size in terms of the units being measured. Why calculate d when you already have p values? Here's an example from a dataset I'm currently working with. I am looking at a behavioural intervention conducted in schools, measured using validated psychological questionnaires (producing Likert data). Almost all of my variables show statistically significant change, perhaps unsurprising as I have a large sample (n=~250). However, for some of the variables, the Cohen's d is quite miniscule, say 0.12 which indicates that although there is certainly change, it might not be a clinically important change and so it is important to the discussion and interpretation of what's going on in the data. This concept is widely used in psychology and health sciences where the practitioners (or schools, in your case) need to consider the actual clinical utility of treatments (or whatever they're experimenting with). Cohen's d helps us answer questions about whether its really worth doing an intervention (regardless of p values). In medical sciences they also like to consider the NNT, and evaluate this in terms of the severity of the condition in question. Have a look at this great resource from @krstoffr http://rpsychologist.com/d3/cohend/

score 3 · Answer 4 · edited Apr 13 '17 at 12:44

3

In fact, p-values are now finally 'out of fashion' as well: http://www.nature.com/news/psychology-journal-bans-p-values-1.17001. Null hypothesis significance testing (NHST) produces little more than a description of your sample size.(*) Any experimental intervention will have some effect, which is to say that the simple null hypothesis of 'no effect' is always false in a strict sense. Therefore, a 'non-significant' test simply means that your sample size wasn't big enough; a 'significant' test means you collected enough data to 'find' something.

The 'effect size' represents an attempt to remedy this, by introducing a measure on the natural scale of the problem. In medicine, where treatments always have some effect (even if it's a placebo effect), the notion of a 'clinically meaningful effect' is introduced to guard against the 50% prior probability that a 'treatment' will be found to have 'a (statistically) significant positive effect' (however minuscule) in an arbitrarily large study.

If I understand the nature of your work, Clarinetist, then at the end of the day, its legitimate aim is to inform actions/interventions that improve education in the schools under your purview. Thus, your setting is a decision-theoretic one, and Bayesian methods are the most appropriate (and uniquely coherent[1]) approach.

Indeed, the best way to understand frequentist methods is as approximations to Bayesian methods. The estimated effect size can be understood as aiming at a measure of centrality for the Bayesian posterior distribution, while the p-value can be understood as aiming to measure one tail of that posterior. Thus, together these two quantities contain some rough gist of the Bayesian posterior that constitutes the natural input to a decision-theoretic outlook on your problem. (Alternatively, a frequentist confidence interval on the effect size can be understood likewise as a wannabe credible interval.)

In the fields of psychology and education, Bayesian methods are actually quite popular. One reason for this is that it is easy to install 'constructs' into Bayesian models, as latent variables. You might like to check out 'the puppy book' by John K. Kruschke, a psychologist. In education (where you have students nested in classrooms, nested in schools, nested in districts, ...), hierarchical modeling is unavoidable. And Bayesian models are great for hierarchical modeling, too. On this account, you might like to check out Gelman & Hill [2].

[1]: Robert, Christian P. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. 2nd ed. Springer Texts in Statistics. New York: Springer, 2007.

[2]: Gelman, Andrew, and Jennifer Hill. Data Analysis Using Regression and Multilevel/hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press, 2007.

For more on 'coherence' from a not-necessarily-beating-you-on-the-head-with-a-Bayesian-brick perspective, see [3].

[3]: Robins, James, and Larry Wasserman. “Conditioning, Likelihood, and Coherence: A Review of Some Foundational Concepts.” Journal of the American Statistical Association 95, no. 452 (December 1, 2000): 1340–46. doi:10.1080/01621459.2000.10474344.

(*) In [4], Meehl scourges NHST far more elegantly, but no less abrasively, than I do:

Since the null hypothesis is quasi-always false, tables summarizing research in terms of patterns of “significant differences” are little more than complex, causally uninterpretable outcomes of statistical power functions.

[4]: Meehl, Paul E. “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology.” Journal of Consulting and Clinical Psychiatry 46 (1978): 806–34. http://www3.nd.edu/~ghaeffel/Meehl(1978).pdf

And here's a related quote from Tukey: https://stats.stackexchange.com/a/728/41404

edited Apr 13 '17 at 12:44

Community

1

answered Nov 17 '15 at 14:55

David C. Norris

2,107
9
25

1

"**Any** experimental intervention will have some effect" (my emphasis) is a rather strong statement, as is the subsequent "always". In some fields of study it is probably an excellent rule of thumb, but I think there are dangers in being too sweeping. I'd also suggest that "[NHST] produces little more than a description of your sample size" is arguable: the p-value emerges from an interplay between *both* the size of the sample *and* the size of the effect. – Silverfish Nov 17 '15 at 22:38
@Silverfish, thank you for your reply. I would invite you to provide an **example** where my perspective on p-values would be 'dangerous'. (BTW, I had put *some* in italics, and used the phrase "in a strict sense" in anticipation of a complaint such as yours. My claim still stands.) Furthermore, although the p-value indeed "emerges from an interplay" of two other factors, one of those (sample size) is largely a free design parameter, chosen arbitrarily. That arbitrary choice is what the p-value therefore reflects. Two numbers are clearly needed; why not the endpoints of a confidence interval? – David C. Norris Nov 18 '15 at 14:36
2

As an example: any instance where we might reasonably expect the null hypothesis to be true, or at least where we couldn't state outright that we're certain it's false without even bothering to conduct an experiment or look at the data. Not *all* nulls are false: consider research in parapsychology, such as telepathy and precognition experiments, but many nulls are true in fields that you might consider more "scientifically valid" such as genomics. – Silverfish Nov 18 '15 at 15:22
5

-1, there are a lot of problems here, IMO. The fact that 1 minor psychology journal banned p-values does not mean that "p-values are now finally 'out of fashion'". The ban has been widely criticized (including a [polite statement by the ASA](http://community.amstat.org/blogs/ronald-wasserstein/2015/02/26/asa-comment-on-a-journals-ban-on-null-hypothesis-statistical-testing) & has not been taken up by any other journals in the months since. I note that the journal is not requiring a switch to Bayesian methods (which I gather is your preference), but only will consider it on a case-by-case basis. – gung - Reinstate Monica Nov 18 '15 at 17:36
It is reasonable that point nulls at 0 are (almost) always false when working with observational data, b/c everything is ultimately related to everything else via a sufficiently long and tenuous causal pathway, unless multiple such pathways cancel ([cf](http://stats.stackexchange.com/questions/26300/26304#comment48462_26304)). – gung - Reinstate Monica Nov 18 '15 at 17:36
3

However, in a true experiment, the process of randomizing units breaks endogenous pathways affording a test of a direct causal path from X to Y. It is a strange metaphysical claim to assert that all variables are directly causally connected in both directions, but if you do not hold this, it is incoherent to claim that the “null hypothesis of 'no effect' is always false”. – gung - Reinstate Monica Nov 18 '15 at 17:37
Dogmatism aside, Bayesian methods are not the only coherent approach to statistics (although I doubt anything I say will make a difference here). Frequentist confidence intervals are not “wannabe credible intervals” whether they are for means, variances, measures of effect size or anything else. The interpretation is the same: in the long run, intervals constructed this way on data from the same data generating process will include the value of interest 1-alpha% of the time. – gung - Reinstate Monica Nov 18 '15 at 17:37
@Silverfish, I wonder how p-values (on hypotheses of *no effect*) help in demonstrating paranormal claims to be false. Isn't it the *minuscule* p-value that is thought to have persuasive force? If you're targeting claims of the paranormal, why not simply put a CI on the effect size? You still didn't address the **danger** you appealed to. If anything, the 'danger' is that a small p-value will be accepted as 'proof' of a paranormal (or other false) claim. http://dx.doi.org/10.1371/journal.pmed.0020124. This is a **true** danger in the biomedical sciences, causing **real harm** to human beings! – David C. Norris Nov 18 '15 at 17:55
@gung, you've ventured rather far afield in your remarks. As I indicated in my last reply to Silverfish, if you are communicating a p-value in order to persuade, then you are surely doing this to *reject* a null hypothesis. So then, *what good is a null hypothesis that you believe to be strictly true?* The Feb 2015 ASA statement you link stated, "A group of more than two-dozen distinguished statistical professionals is developing an ASA statement on p-values and inference that highlights the issues and competing viewpoints." Has this statement yet been produced, to your knowledge? – David C. Norris Nov 18 '15 at 18:22
I don't know if the letter was published. – gung - Reinstate Monica Nov 18 '15 at 18:47
@gung, apparently two separate groups of commentary have been published. See p. 5 et seq of this: http://bayesian.org/sites/default/files/fm/bulletins/1503.pdf, and also here: http://www.statslife.org.uk/opinion/2114-journal-s-ban-on-null-hypothesis-significance-testing-reactions-from-the-statistical-arena. I note that many commentators are reading into the BASP ban a more dramatic rejection of *all* statistical inference. But nothing I've said above depends on the exact details of BASP's policies. Just reading these 2 commentaries shows p-values can *in no way* be described as 'in fashion'! – David C. Norris Nov 19 '15 at 06:29
@gung in follow-up, I note that the ASA has now been able to arrive at a [more vigorous opinion](http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108) on the matter, decidedly **against** the decrepit old view of p-values. To quote Bob Dylan: "Come mothers and fathers throughout the land / Don't criticize what you can't understand / Your sons and your daughters are beyond your command / Your old road is rapidly fading / So get outta the new one if you can't lend a hand / For the times, they are 'a chaa-anging!" – David C. Norris Mar 07 '16 at 16:19

score 2 · Answer 5 · answered Nov 17 '15 at 00:56

What you wrote is not a test statistic. It's a measure used to define how different the two means are. Generally, effect sizes are used to quantify how far from the null hypotheses the something is. For example, if you are doing power analysis for the two sample $t$-test, you might quantify the power as a function of the effect size (for a fixed $n$) you just wrote (which, I think, is called Cohen's D). In other contexts, the effect size might be something else.

It is also not uncommon to report effect sizes using sample quantities, which may coincide with some familiar statistics, such as the pearson correlation - the true effect size is the underlying correlation coefficient that generated the data, but the sample correlation is also useful information to have sometimes. The purpose to quantify how far from the null hypothesis the observed data are, in one way or another, rather than just reporting a $p$-value and calling it a day.

What is effect size... and why is it even useful?

5 Answers5

Linked