1

Considering a simple linear regression model e.g. $y_i = \alpha + \beta x_i + \epsilon $ , in probabilistic terms: $$ \mu_i = \alpha + \beta x_i $$ $$ y_i \sim \mathcal{N}(\mu_i, \sigma) $$

We assume prior distributions on the parameters $\alpha, \beta, \sigma$ and apply Bayes theorem (posterior $\propto$ likelihood x prior) i.e.:

$$f(\alpha, \beta, \sigma | Y, X) \propto \prod_{i=1}^{n} \mathcal{N}(y_i; \alpha + \beta x_i, \sigma) f_\alpha(\alpha)f_\beta(\beta) f_\sigma (\sigma)$$

With priors and data, we then perform Markov Chain Monte Carlo method for sampling from this posterior distribution.

My question is:

  • How do loss functions or different loss functions fit into this picture? How do loss functions come into the next stage of taking a decision based on the posterior distrubtion?
Mike Tauber
  • 807
  • 1
  • 5
  • 14
  • 1
    Loss functions come into the next stage, where you take a decision based on the posterior distribution, using the loss function to help decide which decision is optimal. – Henry Nov 23 '21 at 09:08
  • That is useful thanks @Henry – Mike Tauber Nov 23 '21 at 10:05

1 Answers1

2

How do loss functions or different loss functions fit into this picture? Do different loss functions result in a different mechanism for calculating likelihood?

This is how I read your question: we could fit regression by minimizing different loss functions, how they can be applied to Bayesian regression?

Regression, in non-Bayesian setting, can be fitted by minimizing loss or maximizing the likelihood. Both approaches are equivalent, for example, minimizing squared loss is equivalent to maximizing Gaussian likelihood, absolute loss is equivalent to using Laplace distribution for likelihood, etc. Same in Bayesian regression, you are not minimizing the loss to fit the model, but the choice of likelihood function plays a similar role.

Same applies to using regularization. In Bayesian setting, regularization is dealt by choosing appropriate priors. For example, $\ell_2$ penalty is equivalent to using Gaussian priors for the parameters.

Tim
  • 108,699
  • 20
  • 212
  • 390