Overconfidence of Bayesian linear regression models

Question

I have encountered a problem with Bayesian linear regression models, which I describe in the following. I hope that someone can give me a better understand of Bayesian models or has a possible fix.

Lets say that we have two different functions (for simplicity a quadratic and linear one):

$f_1(x) = a_1 + a_2x + a_3x^2$

$f_2(x) = b_1 + b_2x$

Now, we want to infer the parameters ($b_1, b_2$) of the second function based on data of the first one by Bayesian linear regression. A small noise term is known and fixed for this example. This is my result (my matlab code is attached down below):

As can be seen the posterior distribution of both parameters is very narrow and confident. However, we see that the predictive distribution does not fit the data points very well. I would assume that the uncertainty of the parameters would be much larger in order to compensate for the model mismatch (namely the quadratic term). Does someone know why my intuition does not fit to the results and has a possible fix/workaround? Thank you.

% - 2022-02-04
%% - code
% - cleanup
clear;
close all;
clc;

% - fix random number generator
seed = 29057;
rng(seed);

% - first function (ground truth, unkown)
fcn1.p = [1, 2, 3]';
fcn1.basefcn = @(x) [ones(size(x, 1), 1), x, x.^2];
fcn1.np = length(fcn1.p);

% - second function
fcn2.basefcn = @(x) [ones(size(x, 1), 1), x];
fcn2.np = 2;

% - data points
sn2 = 1;
nd = 100;
X = sort(randn(nd, 1));
y = fcn1.basefcn(X)*fcn1.p + sqrt(sn2)*randn(nd, 1);

% - prior
M0 = zeros(fcn2.np, 1);
S0 = eye(fcn2.np);
iS = inv(S0);

% - posterior
Z = fcn2.basefcn(X);
S = inv(iS +  Z'*Z/sn2);
M = S*(iS*M0 + Z'*y/sn2); %#ok<*MINV>

% - grid
grid0.n = 1000;
grid0.x = linspace(min(X), max(X), grid0.n)';
grid0.y = fcn1.basefcn(grid0.x)*fcn1.p;
grid0.ym = fcn2.basefcn(grid0.x)*M;
grid0.ys = sqrt(diag(sn2 + fcn2.basefcn(grid0.x)*S*fcn2.basefcn(grid0.x)'));

grid1.n = 1000;
grid1.min = min(M0(1)-5*sqrt(S0(1, 1)), M(1)-5*sqrt(S(1, 1)));
grid1.max = max(M0(1)+5*sqrt(S0(1, 1)), M(1)+5*sqrt(S(1, 1)));
grid1.p1 = linspace(grid1.min, grid1.max, grid1.n)';
grid1.prior = normpdf(grid1.p1, M0(1), sqrt(S0(1, 1)));
grid1.posterior = normpdf(grid1.p1, M(1), sqrt(S(1, 1)));

grid2.n = 1000;
grid2.min = min(M0(2)-5*sqrt(S0(2, 2)), M(2)-5*sqrt(S(2, 2)));
grid2.max = max(M0(2)+5*sqrt(S0(2, 2)), M(2)+5*sqrt(S(2, 2)));
grid2.p2 = linspace(grid2.min, grid2.max, grid1.n)';
grid2.prior = normpdf(grid2.p2, M0(2), sqrt(S0(2, 2)));
grid2.posterior = normpdf(grid2.p2, M(2), sqrt(S(2, 2)));

% - plot
figure(1);
subplot(2, 2, 1:2);
plot(grid0.x, grid0.y, 'g-');
hold on;
grid on;
plot(X, y, 'ko');
plot(grid0.x, grid0.ym + 2*grid0.ys, 'r-');
plot(grid0.x, grid0.ym - 2*grid0.ys, 'r-');
title('Predictive distribution');
xlabel('x');
ylabel('y');
legend('Ground truth', 'Data points', 'Predictive distribution (\mu\pm2\sigma)', 'Location', 'northwest');

subplot(2, 2, 3);
scatter(fcn1.p(1), 0, 'g', 'filled');
hold on;
grid on;
plot(grid1.p1, grid1.prior/max(grid1.prior), 'b-');
plot(grid1.p1, grid1.posterior/max(grid1.posterior), 'r-');
title('First parameter');
xlabel('b_1');
ylabel('p(b_1)');
legend('Ground truth', 'Prior', 'Posterior', 'Location', 'northwest');

subplot(2, 2, 4);
scatter(fcn1.p(2), 0, 'g', 'filled');
hold on;
grid on;
plot(grid2.p2, grid2.prior/max(grid2.prior), 'b-');
plot(grid2.p2, grid2.posterior/max(grid2.posterior), 'r-');
title('Second parameter');
xlabel('b_2');
ylabel('p(b_2)');

Edit: A follow-up post can be found here.

Asking for a clarification: is your goal to use (bayesian) linear regression to approximate a known quadratic? When approximating a quadratic function by a linear one the coefficients will not necessarily 'match up' that is, in general, there is little reason to suspect that $(a_1, a_2) \approx (b_1, b_2)$ — jcken, Feb 04 '22 at 10:34
Have you tried instead using some out-of-the box implementation? Most likely it would show different result and would be a good benchmark to start debugging your code. — Tim, Feb 04 '22 at 10:42
@jcken My goal is to "represent" the model mistach (the quadratic) term by the uncertainty of the parameters $(b_1, b_2$. They dont have to match up with $(a_1, a_2$, but their posterior variance should be a lot higher. This is a least what I expected. — Looper, Feb 04 '22 at 11:34
@Tim yes, i also used MATLABs buid-in function "bayeslm", which gives the same result. — Looper, Feb 04 '22 at 11:35
@Looper I think if you want to assess the mismatch between the two models then you will gain much more insight by analysing the residuals — jcken, Feb 04 '22 at 13:16
I'm a little late to the party; the other responses have covered most of the issues raised by this question well. I just want to add two points: (i) nothing special about Bayes here; this is an issue with model lack-of-fit that will appear in the same form in the usual frequentist version of linear regression; (ii) even though the parameters appear to be increasingly well-estimated with increasing sample size the residuals will reflect both noise and model lack-of-fit; as a consequence, the posterior _predictive_ variance will be inflated (and a frequentist prediction interval would be too). — Cyan, Feb 10 '22 at 23:42
@Cyan I have posted a follow-up topic. The link is edited above. Maybe you have some remarks on my questions there as well. — Looper, Feb 11 '22 at 07:29

J. Delaney · Accepted Answer · 2022-02-04T11:50:19.800

5

You should remember that in Bayesian statistics, the outcome (posterior probability) always depends on your prior assumptions (expressed as a prior distribution). In your case, your assumption is is that the response is linear. The meaning of the small uncertainties in this case, is that under such as an assumption, the parameters can be determined with confidence.

Another way to put this is if you were absolutely 100% sure that the true response is linear, those would be the uncertainties you would assign to the fit parameters.

But if you are not 100% sure then that's a different story - then you have to account for your prior lack of knowledge by choosing a different prior for the model(s). You cannot expect the Bayesian calculation by itself to tell you if your choice of prior is correct or not.

More simply - is there a reason you don't want to include the quadratic term in the fit ?

edited Feb 04 '22 at 11:50

answered Feb 04 '22 at 11:09

J. Delaney

1,447
1
8

Thank you for your reply. I understand your point. My scenario is actually as follows. I consider a simple pendulum from which I have taken measurement data. My model contains several parameters, including one for linear friction. The real pendulum from which the measured data are taken will certainly not have linear friction, but much more complicated friction. I have now supposed that a specific uncertainty remains in the posterior of the parameters, which represents the unknown friction. With such a model, I could then do a Monte Carlo simulation in time for the future pendulum motion. – Looper Feb 04 '22 at 12:11
Is there a better method than Bayesian linear regression for my problem? I don't want anything to change in the model structure. Only the determination of the posterior distribution has to be adjusted. I really just want the variance of the parameters to tell me: I don't know what's going on and I can't fit the data with my model. – Looper Feb 04 '22 at 12:11
This description of the setup of the problem is not consistent with entertaining a linear model at any point in the analysis. – Frank Harrell Feb 04 '22 at 12:20
@Looper two points: (1) you can add the quadratic term and still use Bayesian linear regression (Linear = Linear in the model parameters. you will just have 3 parameters instead of two). (2) In general what you are asking the variance to tell you is impossible. you can fit every model with every data. The only thing you can really ask (whether explicitly or implicitly) is if another model fits the data better. – J. Delaney Feb 04 '22 at 12:25
@J.Delaney (1) I am aware of this, but I do not want to destroy the "analytical" form of the model by adding more terms. (2) Intuitively, I would say that it must work. I am thinking of a constrained optimization problem where one says that the posterior distribution must be such that all data points are in the band of two times the standard deviation around the mean of the predictive distribution. It's just a first idea. Anyway, thanks for helping me with my understanding problem. You are right, with vanilla Bayesian linear regression it will definitely not work. – Looper Feb 04 '22 at 12:40
@Looper If your real pendulum will not have linear friction then you should not model the friction to be linear. Neither in your regression nor in the prediction for the future pendulum motion. Consider non-linear models. Either by linear modelling with quadratic, logarithmic, ... terms or by a generalized additive model (GAM) (which appears favourable over trying things just because they are possible in linear regression. JMTC). – Bernhard Feb 04 '22 at 12:52
@Looper to emphasize point(1), it will remain a perfectly solvable linear regression model with exactly the same analytical from if you add quadratic (or any other higher order) terms. You will not even have to significantly modify you code. – J. Delaney Feb 04 '22 at 13:03

score 3 · Answer 2 · answered Feb 04 '22 at 11:00

3

I have not read the code but assume, that will be ok. When you ask your computer for the best fitting straight line through your data, it is very confident about which ist the best fitting straight line. That is why your posterior is so narrow and that has nothing to do with whether a straight line is a good idea in this case.

Whether you are going Frequentist oder Bayesian, you have to analyse your residuals and when you plot the residuals against your x you will find that there is a pattern/problem. A narrow posterior is not a replacement for analysis of residuals.

answered Feb 04 '22 at 11:00

Bernhard

7,419
14
36

Thank you for your reply. Do you know a way to prevent the disappearance of the variance? I really just want the variance of the parameters to tell me: I don't know what's going on and I can't fit the data with my model. – Looper Feb 04 '22 at 12:15
what @Bernhard is explaining is that you cannot assess lack of model fit by checking the uncertainty in the parameter estimates. you need to use a goodness of fit statistic, see eg https://stats.stackexchange.com/a/70208/27556. – seanv507 Feb 09 '22 at 12:09

score 1 · Answer 3 · answered Feb 04 '22 at 12:55

You are not going to get out of "weird" results, if you assume a model that is clearly false. I.e. you have knowingly mis-specified the model by assuming the trend 100% must be linear and cannot be anything else and when doing so, you get inference conditional on that mis-specified model.

What you'd want to do to get a more sensible model fit is to change what model you assume. There's several options for that:

You could use a phsics-motivated model. E.g. modern Bayesian packages (like the rstan or brms R packages) let you fit models specified through (ordinary) differential equations without having to find an analytical solution for the differential equation (see e.g. this blog post). From your comments about this being a physical system with a pendulum, this could very well fit the situation nicely.
When you don't have an understanding of the underlying data generating mechanism, then splines are an obvious option that is supported in most Bayesian regression modeling packages such as brms. There's also Gaussian processes, which could in this setting do similar things to splines, but these seem to be computationally more challenging.

Overconfidence of Bayesian linear regression models

3 Answers3

Linked