13

I recently watched this talk by Eric J. Ma and checked his blog entry, where he quotes Radford Neal, that Bayesian models do not overfit (but they can overfit) and when using them, we do not need test sets for validating them (for me the quotes seem to talk rather about using validation set to adjust the parameters). Honestly, the arguments do not convince me and I don't have access to the book, so could you give more detailed and rigorous argument for, or against such statement?

By the way, in the meanwhile, Eric Ma has pointed me this discussion on the same topic.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 4
    One major hole in this argument in regards to that talk: If you're doing MCMC, if you don't fully explore the posterior, your inference is totally invalid. If you are doing inference on a Bayesian Neural Network, you almost certainly have not explored very large portions of the posterior using MCMC. Therefore, you'd better split your data to double check your inference! – Cliff AB Mar 01 '19 at 05:55
  • 1
    one thing to consider is what are we evaluating or validating? it may be that we don't use all the information we have (either in prior or likelihood). checking model fit can help with answering this question. – probabilityislogic May 05 '19 at 10:56

3 Answers3

12

If we use "the one true model" and "true priors" reflecting some appropriately captured prior information, then as far as I am aware a Bayesian truly does not have an overfitting problem and that posterior predictive distribution given very little data will be suitably uncertain. However, if we use some kind of pragmatically chosen model (i.e. we have decided that e.g. the hazard rate is constant over time and an exponential model is appropriate or e.g. that some covariate is not in the model = point prior of coefficient zero) with some default uninformative or regularizing priors, then we really do not know whether this still applies. In that case the choice of (hyper-)priors has some arbitrariness to it that may or may not result in good out of sample predictions.

Thus, it is then very reasonable to ask the question whether the hyperparameter choice(=parameters of the hyperpriors) in combination with the chosen likelihood will perform well. In fact, you could easily decide that it is a good idea to tune your hyperparameters to obtain some desired prediction performance. From that perspective a validation set (or cross-validation) to tune hyperparameters and test set to confirm performance make perfect sense.

I think this is closely related to a number of discussions of Andrew Gelman on his blog (see e.g. blog entry 1, blog entry 2, blog entry 3 on LOO for Stan and discusions on posterior predictive checks), where he discusses his concerns around the (in some sense correct) claims that a Bayesian should not check whether their model makes sense and about practical Bayesian model evaluation.

Of course, we very often are the most interested in using Bayesian methods in settings, where there is little prior information and we want to use somewhat informative priors. At that point it may become somewhat tricky to have enough data to get anywhere with validation and evaluation on a test set.

Björn
  • 21,227
  • 2
  • 26
  • 65
4

So I answered the question on overfitting that you reference and I watched the video and read the blog post. Radford Neal is not saying that Bayesian models do not overfit. Let us remember that overfitting is the phenomenon of noise being treated as signal and impounded into the parameter estimate. That is not the only source of model selection error. Neal's discussion is broader though by venturing into the idea of a small sample size he ventured into the discussion of overfitting.

Let me partially revise my prior posting that Bayesian models can overfit to all Bayesian models overfit, but do so in a way that improves prediction. Again, going back to the definition of confusing signal with noise, the uncertainty in Bayesian methods, the posterior distribution, is the quantification of that uncertainty as to what is signal and what is noise. In doing so, Bayesian methods are impounding noise into estimates of signal as the whole posterior is used in inference and prediction. Overfitting and other sources of model classification error is a different type of problem in Bayesian methods.

To simplify, let us adopt the structure of Ma’s talk and focus on linear regression and avoid the deep learning discussion because, as he points out, the alternative methods he mentions are just compositions of functions and there is a direct linkage between the logic of linear regression and deep learning.

Consider the following potential model $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3.$$ Lets create a broad sample of size $N$ composed of two subsamples, $n_1,n_2$, where $n_1$ is the training set and $n_2$ is the validation set. We will see why, subject to a few caveats, Bayesian methods do not need a separate training and validation set.

For this discussion, we need to create eight more parameters, one for each model. They are $m_1\dots{_8}$. They follow a multinomial distribution and have proper priors as do the regression coefficients. The eight models are $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3,$$ $$y=\beta_0,$$ $$y=\beta_0+\beta_1x_1,$$ $$y=\beta_0+\beta_2x_2,$$ $$y=\beta_0+\beta_3x_3,$$ $$y=\beta_0+\beta_1x_1+\beta_2x_2,$$ $$y=\beta_0+\beta_1x_1+\beta_3x_3,$$ $$y=\beta_0+\beta_2x_2+\beta_3x_3,$$ $$y=\beta_0+\beta_1x_1,$$ $$y=\beta_0+\beta_2x_2,$$ and $$y=\beta_0+\beta_3x_3.$$

Now we need to get into the weeds of the differences between Bayesian and Frequentist methods. In training set, $n_1,$ the modeler using Frequentist methods chooses just one model. The modeler using Bayesian methods is not so restricted. Although the Bayesian modeler could use a model selection criterion to find just one model, they are also free to use model averaging. The Bayesian modeler is also free to change selected models in midstream in the validation segment. Moreso, the modeler using Bayesian methods can mix and match between selection and averaging.

To give a real-world example, I tested 78 models of bankruptcy. Of the 78 models, the combined posterior probability of 76 of them was about one ten-thousandth of one percent. The other two models were roughly 54 percent and 46 percent respectively. Fortunately, they also did not share any variables. That allowed me to select both models and ignore the other 76. When I had all the data points for both, I averaged their predictions based on the posterior probabilities of the two models, using only one model when I had missing data points that precluded the other. While I did have a training set and validation set, it wasn’t for the same reason a Frequentist would have them. Furthermore, at the end of every day over two business cycles, I updated my posteriors with each day’s data. That meant that my model at the end of the validation set was not the model at the end of the training set. Bayesian models do not stop learning while Frequentist models do.

To go deeper let us get concrete with our models. Let us assume that during the training sample the best fit Frequentist model and the Bayesian model using model selection matched or, alternatively, that the model weight in model averaging was so great that it was almost indistinguishable to the Frequentist model. We will imagine this model to be $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3.$$ Let’s also imagine that the true model in nature is $$y=\beta_0+\beta_1x_1+\beta_3x_3.$$

Now let's consider the difference in the validation set. The Frequentist model is overfitted to the data. Let’s assume that by some point $n_2^i$ that the model selection or validation procedure had changed the selection to the true model in nature. Further, if model averaging was used, then the true model in nature carried weight in the prediction long before the choice of models was clear-cut. E.T. Jaynes in his tome on probability theory spends some time discussing this issue. I have the book at work so I cannot get you a good citation, but you should read it. Its ISBN is 978-0521592710.

Models are parameters in Bayesian thinking and as such are random, or if you would prefer, uncertain. That uncertainty does not end during the validation process. It is continually updated.

Because of the differences between Bayesian and Frequentist methods, there are other types of cases that also must be considered. The first comes from parameter inference, the second from formal predictions. They are not the same thing in Bayesian methods. Bayesian methods formally separate out inference and decision making. They also separate out parameter estimation and prediction.

Let’s imagine, without loss of generality, that a model would be successful if $\hat{\sigma^2}<k$ and a failure otherwise. We are going to ignore the other parameters because it would be a lot of extra work to get at a simple idea. For the modeler using Bayesian methods, this is a very different type of question than it is for the one using Frequentist methods.

For the Frequentist a hypothesis test is formed based off of the training set. The modeler using Frequentist methods would test whether the estimated variance is greater than or equal to $k$ and attempt to reject the null over the sample whose size is $n_2$ by fixing the parameters to those discovered in $n_1$.

For the modeler using Bayesian methods, they would form parameter estimates during from sample $n_1$ and the posterior density of $n_1$ would become the prior for sample $n_2$. Assuming the exchangeability property holds, then it is assured that the posterior estimate of $n_2$ is equal in all senses of the word of that of a probability estimate formed from the joint sample. Splitting them into two samples is equivalent by force of math to having not split them at all.

For predictions, a similar issue holds. Bayesian methods have a predictive distribution that is also updated with each observation, whereas the Frequentist one is frozen at the end of sample $n_1$. The predictive density can be written as $\Pr(\tilde{x}=k|\mathbf{X})$. If $\tilde{x}$ is the prediction and $\mathbf{X}$ is the sample, then where are the parameters, which we will denote $\theta?$ Although Frequentist prediction systems do exist, most people just treat the point estimates as the true parameters and calculate residuals. Bayesian methods would score each prediction against the predicted density rather than just one single point. These predictions do not depend upon the parameters which are different from the point methods used in Frequentist solutions.

As a side note, formal Frequentist predictive densities do exist using the standard errors, and scoring could be done on them, but this is rare in practice. If there is no specific prior knowledge, then the two sets of predictions should be identical for the same set of data points. They will end up differing because $n_1+n_2>n_1$ and so the Bayesian solution will impound more information.

If there is no material prior information and if Frequentist predictive densities are used rather than point estimates, then for a fixed sample the results of the Bayesian and Frequentist methods will be identical if a single model is chosen. If there is prior information, then the Bayesian method will tend to generate more accurate predictions. This difference can be very large in practice. Further, if there is model averaging, then it is quite likely that the Bayesian method will be more robust. If you use model selection and freeze the Bayesian predictions, then there is no difference to using a Frequentist model using Frequentist predictions.

I used a test and validation set because my data was not exchangeable. As a result, I needed to solve two problems. The first is similar to burn-in in MCMC methods. I needed a good set of parameter estimates to start my test sequence, and so I used fifty years of prior data to get a good prior density to start my validation test. The second problem was that I needed some form of standardized period to test in so that the test would not be questioned. I used the two prior business cycles as dated by NBER.

Dave Harris
  • 6,957
  • 13
  • 21
  • 1
    But then, say that you estimated a MAP for linear regression model with "uninformative" priors. This would be equivalent of obtaining the maximum likelihood estimate for the model, so ML doesn't need test set either, assuming exchangeability? – Tim Mar 13 '18 at 07:48
  • "overfitting is the phenomenon of noise being treated as signal and impounded into the parameter estimate" I believe this definition is specific towards additive noise models. Otherwise overfitting vs underfitting is not so well defined. – Cagdas Ozgenc Mar 13 '18 at 08:54
  • @CagdasOzgenc thanks. Do you have a suggested edit? – Dave Harris Mar 13 '18 at 16:54
  • @Tim I never mentioned the MAP estimator. If you reduce the problem down to the MAP estimator then you surrender the robustness. The MAP estimator is the point that minimizes a cost function over a density. This can be problematic for projections if the density lacks a sufficient statistic. The MAP estimator would, intrinsically, lose information. If you were using the MAP estimator, which is not in the original question and clearly not a part of Ma's presentation, then you create a different set of problems for yourself. – Dave Harris Mar 13 '18 at 16:58
  • @Tim The MAP estimator comes from Bayesian decision theory and it is an overlay on top of Bayesian estimation and inference. The MAP is convenient. There is a price to be paid for when choosing convenience. Unless the all-or-nothing cost function is your true cost function, you are surrendering both information and accuracy. You also end up different methodological issues than proposed in Ma's presentation. – Dave Harris Mar 13 '18 at 17:02
  • It is a kind of reductio ad absurdum, but then for MAP (and MLE) I would need a test set and when estimating the distribution I wouldn't? – Tim Mar 13 '18 at 17:52
  • It is dangerous to equate the MAP and the MLE. Under certain defined conditions, with a fixed and defined sample, they are computationally the same. The MAP would float while the MLE would not as you added data. Still, you are drifting from the question and without more knowledge of the question, the quality of estimate the MAP is would be unknown. Ma is not discussing the MAP. The MAP adds a layer of complexity not present in the above discussions. That said, the MAP undergoes Bayesian updating along with the posterior. It isn't possible to judge here what losses the MAP creates. – Dave Harris Mar 14 '18 at 03:49
  • 1
    i note even here you cut down the options. why not consider terms like $x_2\times x_3$ or $cos(x_1^2)$ as candidates in your model? There is an infinite class of models you can make, you have chosen just $8$. what in your prior information tells you this is "enough"? – probabilityislogic May 05 '19 at 11:03
  • @probabilityislogic It is just an example. If I were working in physics, I might choose $cos(x_1^2)$. It isn't relevant to the question or the original posting. However, you should ask this as an independent question. Almost all probability models assume that the model is the true model, although there is robustness discussion which is extensive in the literature such as omitted variable bias and so forth. I would recommend Cox's original book Cox, R. T. (1961). The Algebra of Probable Inference. Baltimore, MD: Johns Hopkins University Press. as a starting point. – Dave Harris May 06 '19 at 18:30
  • @probabilityislogic alternatively, ET. Jaynes book mentioned above is a way to think about how to ask your question as is Cox's book. Your question is discussed extensively in the literature, particularly in the philosophy of science and mathematics. – Dave Harris May 06 '19 at 18:31
2

I have also pondered this question and my tentative answer is a very practical one. Please take this with a pinch of salt.

Suppose you have a model that has no parameters. For instance, your model could be a curve that predicts the growth of some scalar quantity over time, and you have chosen this particular curve because it is prescribed by some available domain knowledge.

Since your model has no parameters, there is no need for a test set. To evaluate how well your model does, you can apply your model on the entire dataset. By applying, I mean you can check how well your chosen curve goes through observed data and use some criterion (e.g. likelihood) to quantify the goodness of the fit.

Now, in practice, our model will have some parameters. The Bayesian methodology sets the goal of calculating the marginal log-likelihood which involves integrating out all the model parameters. Marginal log-likelihood quantifies how well the model explains the data (or should I say how well the data support the model?). By integrating out, we are left with no parameters to tune/optimise. I will risk saying that this seems to me very similar to the case where we had a model with no parameters, the similarity being that we do not need to adapt any parameters to the observed dataset.

There is a nice quote that I have read in this forum which states that "optimisation is the root of all evil in statistics" (I think it originates from user DikranMarsupial). In my understanding, this quote says that once you have stated your model assumptions, all you have to do is "turn the Bayesian crank". In other words, as long as you can be Bayesian (i.e. integrate out the parameters), then you have no reason to worry about overfitting, as you are considering all possible settings of your parameters (with density dictated by the prior) according to your model assumptions. If instead you need to optimise a parameter, it is difficult to tell whether you are over-adapting it to the particular data you observe (overfitting) or not. One practical way of testing overfitting is of course holding out a test set which is not used when optimising the parameter.

In the presence of competing rival models, that effectively challange your assumptions, you can compare them using marginal log-likelihood in order to find the most likely one. Of course, somebody naughty may posit a model which perfectly replicates the observed data (trivially, it could be the data itself). In such a case, I am not sure how I would defend myself. In real life, however, it would be hard to motivate this contrived model in a setting such as physics where explaination based models are required...

ngiann
  • 1,147
  • 9
  • 13