Regression to the mean in "Thinking, Fast and Slow"

Question

In Thinking, Fast and Slow, Daniel Kahneman poses the following hypothetical question:

(P. 186) Julie is currently a senior in a state university. She read fluently when she was four years old. What is her grade point average (GPA)?

His intention is to illustrate how we often fail to account for regression to the mean when making predictions about certain statistics. In the subsequent discussion, he advises:

(P. 190) Recall that the correlation between two measures—in the present case reading age and GPA —is equal to the proportion of shared factors among their determinants. What is your best guess about that proportion? My most optimistic guess is about 30%. Assuming this estimate, we have all we need to produce an unbiased prediction. Here are the directions for how to get there in four simple steps:

Start with an estimate of average GPA.

Determine the GPA that matches your impression of the evidence.

Estimate the correlation between reading precocity and GPA.

If the correlation is .30, move 30% of the distance from the average to the matching GPA.

My interpretation of his advice is as follows:

Use "She read fluently when she was four years old" to establish a standard score for Julie's reading precocity.
Determine a GPA that has a corresponding standard score. (The rational GPA to predict would correspond to this standard score if the correlation between GPA and reading precocity were perfect.)
Estimate what percentage of variations in GPA can be explained by variations in reading precocity. (I assume he is referring to the coefficient of determination with "correlation" in this context?)
Because only 30% of the standard score of Julie's reading precocity can be explained by factors that can also explain the standard score of her GPA, we are only justified in predicting that the standard score of Julie's GPA will be 30% of what it would be in the case of perfect correlation.

Is my interpretation of Kahneman's procedure correct? If so, is there a more formal mathematical justification of his procedure, especially step 4? In general, what is the relationship between the correlation between two variables and changes/differences in their standard scores?

score 8 · Answer 1 · answered Jan 10 '16 at 14:13

The order of your numbers do not match with the Kahneman quote. Because of this it seems like you may be missing the overall point.

Kahneman's point one is the most important. It means literally estimate the average GPA -- for everyone. The point behind this advice is that it is your anchor. Any prediction you give should be in reference to changes around this anchor point. I'm not sure I see this step in any of your points!

Kahneman uses an acronym, WYSIATI, what you see is all there is. This is the human tendency to overestimate the importance of the information currently available. For many people, the information about reading ability would make people think Julie is smart, and so people would guesstimate the GPA of a smart person.

But, a child's behavior at four contains very little information related to adult behavior. You are probably better off ignoring it in making predictions. It should only sway you from your anchor by a small amount. Also, peoples first guess of a smart persons GPA can be very inaccurate. Due to selection, the majority of seniors in college are above average intelligence.

There actually is some other hidden information in the question besides Julie's reading ability at four years old though.

Julie is likely to be a female name
She is attending a state university
She is a senior

I suspect all three of these characteristics raise the average GPA slightly compared to the overall student population. For example I bet Seniors' likely have a higher GPA than Sophmores' because because students with very bad GPA's drop out.

So Kahneman's procedure (as a hypothetical) would go like something like this.

The average GPA for a female senior in a state university is 3.1.
I guess that based on Julie's advanced reading ability at 4 that her GPA is 3.8
I guess reading ability at 4 years old has a correlation of 0.3 with GPA
Then 30% of the way between 3.1 and 3.8 is 3.3 (i.e. 3.1 + (3.8-3.1)*0.3)

So in this hypothetical the final guess for Julie's GPA is 3.3.

The regression to the mean in Kahneman's approach is that step 2 is likely to be a gross over-estimate of the importance of the information available. So a better strategy is to regress our prediction back to the overall mean. Steps 3 and 4 are (ad-hoc) ways to estimate how much to regress.

I understand the intuition behind the procedure, but not the mathematical justification. My interpretation is that the point of estimating the average GPA is to allow one to estimate specific GPAs in terms of standard scores; otherwise, they could not be meaningfully compared to reading precocity. (Cont.) — Rations, Jan 10 '16 at 15:50
Kahneman mentions that most people guess GPA = 3.7 or 3.8, which probably corresponds with the standard score they associate with Julie's reading precocity, but also implicitly assumes that the correlation between the two variables is perfect. I am mainly confused about whether step 4 is an intuition-based rule of thumb or a real, statistically valid procedure (i.e., can one treat standard scores additively and take proportions of them based on the correlation?). If it is merely a layman's rule of thumb, does there exist a more statistically rigorous method of approximation? — Rations, Jan 10 '16 at 15:58
By "additively", I am referring to our assumption that (1) some proportion of Julie's standard score GPA is explained by factors that can also explain her reading precocity, that (2) the remaining proportion of her standard score GPA is explained by factors unique to explaining GPA, that (3) these contributions summed equals the final standard score GPA we predict for Julie, and that (4) we can correct our prediction by simply taking a proportion of our biased prediction. Is working with proportions of standard deviations like this—as opposed to, say, working with their square roots—valid? — Rations, Jan 10 '16 at 16:08
It is an ad-hoc rule. Steps two and three are not necessarily logically consistent with one another. (They are two different ways of saying the same information, one is an effect size and the other is a standardized effect size.) — Andy W, Jan 10 '16 at 17:33

score 3 · Accepted Answer · answered Jan 10 '16 at 22:28

Is my interpretation of Kahneman's procedure correct?

This is a bit hard to say, because Kahneman's step #2 is not formulated very precisely: "Determine the GPA that matches your impression of the evidence" -- what exactly is that supposed to mean? If somebody's impressions are well calibrated, then there will be no need to correct towards the mean. If somebody's impressions are grossly off, then they should rather correct even stronger.

So I agree with @AndyW that Kahneman's advice is only a rule of thumb.

That said, if you interpret Kahneman's step #2 as you interpreted it in your Interpretation steps ##1--2: i.e. that you take GPA with the same $z$-score as the $z$-score of reading precocity as "matching your impression of the evidence", then your procedure is exactly mathematically correct and not a rule of thumb.

[...] is there a more formal mathematical justification of his procedure, especially step 4? In general, what is the relationship between the correlation between two variables and changes/differences in their standard scores?

If you predict $y$ from $x$ and both are converted into $z$-scores, i.e. have zero mean and unit variance, and have correlation $\rho$ between each other, then it can be easily shown that the regression equation will be $$y=\rho x,$$ i.e. regression coefficient will be equal to the correlation coefficient.

From here it immediately follows that if you know the value of $x$ (e.g. you know the standard score of the reading precocity), then the predicted value of $y$ (standard score of GPA) will be $\rho$ times that.

This is exactly what is called "regression to the mean". You can see some formulas and derivations in the discussion on Wikipedia.

Regression to the mean in "Thinking, Fast and Slow"

2 Answers2