How do we know that our estimate of the mid-season batting average is actually good?

Question

This question is based of David Robinson's excellent illustration of using the Beta distribution to estimate a (baseball) batter's season batting average in the middle of a season.

To recap, the proportion of hits to at bats (aka batting average) is assumed to be (conditionally) Beta distributed with parameter values $\alpha = 81$ and $\beta = 219$, estimated using maximum likelihood with respect to the final observed batting averages in the Lahman dataset*. Then, an estimate of a batter's batting ability in the middle of the season is given by calculating the posterior distribution based on updating the Beta prior with these parameters:

$$\mbox{Beta}(\alpha_0+\mbox{hits}, \beta_0+\mbox{misses}),$$

so that, if a batter in midseason had 100 hits out of 300 at bats, their season batting average $\mu$ could be approximated by the expected value of the posterior by:

$$\hat{\mu} = \frac{81 + 100}{81 + 100 + 219 + 200} = 0.303.$$

My question is, on what basis do we know that this approximation is good/valid? As far as I can tell, the original estimates of the Beta prior's parameters at least have some guarantee of 'goodness' via the MLE method, but I am not sure what other metrics apply to our mid-season prediction. Take this to the extreme, one can apply this update methodology right up to after the second last game of the season. I think that this is unlikely to produce a better prediction than just using the batter's empirical average this season after this game.

*_{In his excellent book, David shows how to write down regression equations for $\alpha$ and $\beta$ so that the estimated parameters, conditional on covariates, maximise the beta log-likelihood with respect to the final historical batting averages. The parameters estimated are different but I use the values from the old CV post for consistency.}

score 3 · Answer 1 · answered Mar 28 '17 at 07:38

I haven't read the book, but maybe the distinction here is that the batter's empirical average is a point estimate, but the beta distribution gives an entire distribution of possible probabilities for getting a hit. In your example, you chose to use the mean of the beta distribution as the point estimate, but you could have just as easily used some other estimator; the mode of the posterior beta distribution perhaps.

From this perspective, the question "On what basis do we know that this approximation is good/valid?" becomes a question of decision theory. Namely, how do we choose the best point estimate given the posterior distribution of hit probabilities? The answer will depend on the loss function specified. For example, if you were playing guess the batting average using price-is-right rules, that is, if you guessed a value too high you get nothing; then intuitively you should guess a lower batting average than if you are just trying to get as close as you possibly can to the probability of getting a hit.

The posterior mean is the best decision under squared error loss in the decision theoretic sense. So if your loss function is actually squared error loss, then on this basis the approximation is valid.

Finally, you may have noticed that if you use an improper prior with $\alpha_0=\beta_0=0$ then the batter's empirical average exactly matches the mean of the posterior. One could argue that in some sense this reflects the fact that if you have little prior information, a good guess at the batter's probability of getting a hit is the current empirical average.

How do we know that our estimate of the mid-season batting average is actually good?

1 Answers1