0

@drob showed a great example of adjusting batting averages using a beta-based prior distribution. He used a prior calculated Beta distribution to adjust batting averages individually, and

it’s as simple as adding α0 to the number of hits, and α0+β0 to the total number of at-bats

enter image description here


My question is:

  • How can I do a similar adjustment for lognormal distributions, using mean and variance? I have an lm model trained on lognormal data, and I would like to incorporate prior knowledge to weight/adjust my predictions (I cannot install Stan etc. due to sysadmin limitations)

Context

Let's say I built the following model, estimating income for different race groups, in 2019. Come 2021, I do not have the same data, but I have some knowledge as to what the mean and stdev of the income looks like, for each race group.

If I have to do stick with some basic adjustment of my 2019 predictions (similar to David Robinson's), I imagine I can simply regularise my previous predictions based on a new distribution of income that I am aware of (instead of collecting new data and building a new model)

But how can I do so with a log-normal? How statistically proper is it?

library(readr)
library(broom)
# Read in data
income_mean <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-09/income_mean.csv')
income_mean$log_income_dollars <- log10(income_mean$income_dollars)

# Model the data
lm1 <- lm(log_income_dollars ~ race + year, data = income_mean)

head(augment(lm1))
#> # A tibble: 6 x 9
#>   log_income_doll~ race   year .fitted  .resid    .hat .sigma .cooksd .std.resid
#>              <dbl> <chr> <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
#> 1             4.18 All ~  2019    4.97 -0.787  0.00256  0.482 7.63e-4     -1.64 
#> 2             4.61 All ~  2019    4.97 -0.363  0.00256  0.482 1.62e-4     -0.753
#> 3             4.84 All ~  2019    4.97 -0.133  0.00256  0.482 2.18e-5     -0.277
#> 4             5.05 All ~  2019    4.97  0.0741 0.00256  0.482 6.76e-6      0.154
#> 5             5.41 All ~  2019    4.97  0.434  0.00256  0.482 2.32e-4      0.902
#> 6             5.65 All ~  2019    4.97  0.683  0.00256  0.482 5.73e-4      1.42

Created on 2021-09-16 by the reprex package (v1.0.0)

info_seeker
  • 227
  • 2
  • 14
  • 1
    It depends on exactly what you're doing. Most likely you will not be able to obtain the posterior distribution in a simple or closed form and you will have to perform the integrals somehow. – whuber Sep 16 '21 at 18:37
  • Thanks @whuber for the comment; just updated my question for clarity – info_seeker Sep 16 '21 at 19:43

1 Answers1

2

Those two models don't have almost anything in common, so you cannot easily transfer the computation from one model to another. The beta-binomial model is a univariate model for binary data, the generalized linear regression model is a model considering the relation between two or more variables and the outcome is continuous.

If you had a single variable following a log-normal distribution, you could use a conjugate prior for its parameters, similar as you do in the beta-binomial model.

For the regression model, a better counterpart would be the Bayesian logistic regression model. Similar, linear regression can be viewed from a Bayesian viewpoint. However this is not a simple "adjustment", the models do not have a closed-form solution, to estimate them you need to use optimization, approximate inference, or MCMC sampling. For details, I recommend you check one of the many Bayesian statistics handbooks. Bayesian paradigm is a completely different way of estimating the parameters of the model, so you don't use it to "adjust" the parameters estimated using maximum likelihood, but rather estimate them differently from the scratch.

Tim
  • 108,699
  • 20
  • 212
  • 390