Bayesian inference with simple models and many data points: impact of priors and the number of data points

Question

The setting. Let us assume I would like to perform Bayesian inference for a low-dimensional model on a large dataset, i.e., there are many more data points than there are parameters to identify. Let us moreover assume - as is always the case in practice - that the model is imperfect: it does not describe reality exactly.

The outcome. It is known that for many data points, the evidence introduced by the data often largely outweighs the influence of the prior. The posterior of the parameters will become more and more narrow with a growing number of data points, even if the model may not fit the data very well and no new information is gained from these new datapoints!

Example. We want to estimate the parameters $a$ and $b$ in the linear model $$ y = ax + b, $$ with priors $p(a)$ and $p(b)$ given, where $x(k)$ and $y(k)$ are discrete-time measurements of two slowly time-varying processes. $x(t)$ and $y(t)$ are approximately, but not exactly, related by the linear model given above: In "reality", $$ y = ax + b + c \cdot\sin(\omega t), $$ where the term $c\cdot\sin(\omega t)$ is not included in the regression model above and hence represents the model error / misspecification in this example.

Consider the following two test cases with different sampling rates ($f_2=10f_1$): The parameter estimates for both test cases are almost the same, but the estimated variances in case 2 are about one third of the variances in case 1. (Code + detailed results below).

Question. Intuitively, I would expect a "reasonable inference procedure" to yield the same result for both settings, including the variances: Essentially, no new information relevant to the given, low-dimensional model is gained from the additional samples in case 2. While I technically understand how it happens, this influence of the sampling rate / number of samples on the inference result appears undesired / defective. How can I fix this? To me, this appears to be an artifact of the inference procedure assuming the model to be perfect. Can I include some kind of "model uncertainty prior" to make the inference yield the same result in both cases, "recognizing" that no new information is gained?

Details on the example (Matlab).

t1 = 0:.01:2*pi; t2 = 0:.001:2*pi;
x = @(t) sin(t) + 0.1*t;
y = @(x, t) x + 0.1 + sin(10*t);
x1 = x(t1); x2 = x(t2);
y1 = y(x1, t1); y2 = y(x2, t2);

subplot(2, 2, 1);
scatter(t1, x1);
xlabel('t1'); ylabel('x1');
subplot(2, 2, 2);
scatter(t1, y1);
xlabel('t1'); ylabel('y1');
subplot(2, 2, 3);
scatter(t2, x2);
xlabel('t2'); ylabel('x2');
subplot(2, 2, 4);
scatter(t2, y2);
xlabel('t2'); ylabel('y2');

n_params = 1;
prior = bayeslm(n_params, 'ModelType', 'conjugate', 'Mu', [0, 0.5], 'V', eye(2));
prior.A = 0.001; prior.B = 0.001;
posterior1 = estimate(prior, x1', y1');
posterior2 = estimate(prior, x2', y2');

Outputs:

Method: Analytic posterior distributions
Number of observations: 629
Number of predictors:   2
Log marginal likelihood: -1316.96

           |  Mean     Std         CI95        Positive        Distribution       
----------------------------------------------------------------------------------
 Intercept | 0.1099  0.0871  [-0.061,  0.281]    0.897   t (0.11, 0.09^2,6.3e+02) 
 Beta      | 0.9679  0.1325  [ 0.708,  1.228]    1.000   t (0.97, 0.13^2,6.3e+02) 
 Sigma2    | 3.6909  0.2088  [ 3.304,  4.122]    1.000   IG(314.50, 0.00086)      


Method: Analytic posterior distributions
Number of observations: 6284
Number of predictors:   2
Log marginal likelihood: -8303.38

           |  Mean     Std         CI95        Positive        Distribution       
----------------------------------------------------------------------------------
 Intercept | 0.1095  0.0130  [ 0.084,  0.135]    1.000   t (0.11, 0.01^2,6.3e+03) 
 Beta      | 0.9698  0.0198  [ 0.931,  1.009]    1.000   t (0.97, 0.02^2,6.3e+03) 
 Sigma2    | 0.8182  0.0146  [ 0.790,  0.847]    1.000   IG(3142.00, 0.00039)

score 1 · Accepted Answer · answered Jul 16 '21 at 23:00

The problem is that linear models assume each data point is independent (i.i.d), while in your example data points are not independent -- they are drawn serially from the same underlying process. A plot of the residuals vs t1 or t2 will reveal a distinct autocorrelated pattern, which is a clear giveaway for this phenomenon.

You've provided an excellent example of how such non-independence can affect standard errors while generating unbiased coefficient estimates. In effect you are 'fooling' the model into thinking it has more independent information than it does. It is generally a huge problem for timeseries and spatial regression models where data are inherently autocorrelated.

The reduction in standard error by 1/3 with a 10-fold increase in sampling makes perfect sense since S.E. scales with 1/sqrt(n).

As for remediation, a common technique is to model the autocorellation-generating process directly, either as a predictor or in the error term. Gaussian process, ARIMA and spatial error models are some examples.

Ahh yes, of course, you're right! Thanks a lot for the late answer! — jhin, Jul 17 '21 at 11:25

Bayesian inference with simple models and many data points: impact of priors and the number of data points

1 Answers1