22

I just ran into "Akaike information criterion", and I noticed this large amount of literature on model selection (also things like BIC seem to exist).

Why don't contemporary machine learning methods take advantage of these BIC and AIC model selection criteria?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
echo
  • 823
  • 7
  • 13
  • 10
    because nobody's calculating the likelihoods? – Aksakal Nov 15 '17 at 14:16
  • 1
    What do you mean by "contemporary machine learning methods"? As far as I used AIC and BIC are used frequently. – Ferdi Nov 15 '17 at 14:16
  • 1
    It is simply an adjusted error variance taking into account the sample size and the # of parameters. In my view/religion/belief system it is simply descriptive and not inferential. – IrishStat Nov 15 '17 at 14:19
  • 1
    Why not calculate the likelihoods? – echo Nov 15 '17 at 14:28
  • 5
    Also why the -1? Remember there are no stupid questions -- each question tries to shed light on the universe – echo Nov 15 '17 at 14:28
  • 4
    @echo: I didn't downvote, but I think your question would be improved if you could source/support the main claim (that machine learning methods do take advantage of these BIC and AIC model selection criteria) – user603 Nov 15 '17 at 14:32
  • @user603 I haven't seen AIC and BIC used in ML practice. Maybe researchers use them, but certainly not practitioners – Aksakal Nov 15 '17 at 14:43
  • 2
    @Aksakal Thanks. I think it is better if questions constructed around a sweeping claim could source that claim. I mean as a general rule. – user603 Nov 15 '17 at 14:46
  • @echo, outside the training sample what's good about likelihoods? also technically, how would you do them? – Aksakal Nov 15 '17 at 14:50

1 Answers1

19

AIC and BIC are used, e.g. in stepwise regression. They are actually part of a larger class of "heuristics", which are also used. For example the DIC (Deviance Information Criterion) is often used in Bayesian Model selection.

However, they are basically "heuristics". While it can be shown, that both the AIC and BIC converge asymptotically towards cross-validation approaches (I think AIC goes towards leave-one-out CV, and BIC towards some other approach, but I am not sure), they are known to under-penalize and over-penalize respectively. I.e. using AIC you will often get a model, which is more complicated than it should be, whereas with BIC you often get a model which is too simplistic.

Since both are related to CV, CV is often a better choice, which does not suffer from these problems.

Then finally there is the issue of the # of parameters which are required for BIC and AIC. With general function approximators (e.g. KNNs) on real-valued inputs, it is possible to "hide" parameters, i.e. to construct a real number which contains the same information as two real numbers (think e.g. of intersecting the digits). In that case, what is the actual number of parameters? On the other hand, with more complicated models, you may have constraints on your parameters, say you can only fit parameters such that $\theta_1 > \theta_2$ (see e.g. here). Or you may have non-identifiability, in which case multiple values of the parameters actually give the same model. In all these case, simply counting of parameters does not give a suitable estimate.

Since many contemporary machine-learning algorithms show these properties (i.e. universal approximation, unclear number of parameters, non-identifiability), AIC and BIC are less useful for these model, than they may seem at first glance.

EDIT:

Some more points that could be clarified:

  1. It seems I was wrong to consider the mapping by interleaving digits a bijection between $\mathbb{R}\rightarrow\mathbb{R}^N$ (see here). However, the details of why this isn't a bijection are a bit hard to understand. However, we don't actually need a bijection for this idea to work (a surjection is enough).
  2. According to proof by Cantor (1877) there must be a bijection between $\mathbb{R}\rightarrow\mathbb{R}^N$. Although this bijection cannot be defined explicitly, it's existence can be proven (but this requires the unproven axiom of choice). This bijection can still be used in a theoretical model (it may not be possible to actually implement this model in a computer), to unpack a single parameter into an arbitrary number of parameters.
  3. We don't actually need the mapping between $\mathbb{R}\rightarrow\mathbb{R}^N$ to be a bijection. Any surjective function $\mathbb{R}\rightarrow\mathbb{R}^N$ is enough to unpack multiple parameters from a single one. Such surjections can be shown to exists as limits to a sequence of other functions (so called space-filling curves, e.g. Peano curve).
  4. Because neither the proof by Cantor is constructive (it simply proves the existence of the bijection without giving an example), nor the space-filling curves (because they only exist as limits of constructive objects and therefore are not constructive themselves), the argument I made is only a theoretical proof. In theory, we could just keep adding parameters to a model to reduce the BIC below any desired value (on the training set). However, in an actual model implementation we have to approximate the space-filling curve, so approximation error may prohibit us from actually doing so (I have not actually tested this).
  5. Because all this requires the axiom of choice, the proof becomes invalid if you don't accept this axiom (although most mathematicians do so). That means, in constructive math this may not be possible, but I don't know what role constructive math plays for statistics.
  6. Identifiability is intrinsically linked to functional complexity. If one simply takes an identifiable $N$-parameter model and adds a superfluous parameter (e.g. not used anywhere), then the new model becomes non-identifiably. Essentially, one is using a model that has the complexity of the $\mathbb{R}^{N+1}$ to solve a problem that has complexity $\mathbb{R}^N$. Similarly, with other forms of non-identifiability. Take for example the case of non-identifiable parameter permutations. In that case, one is using a model that has the complexity of the $\mathbb{R}^N$, however, the actual problem only has the complexity of a set of equivalence classes over the $\mathbb{R}^N$. However, this is only an informal argument, I don't know of any formal treatment of this notion of "complexity".
LiKao
  • 2,329
  • 1
  • 17
  • 25
  • Care to chime in on this post https://stats.stackexchange.com/questions/325129/how-can-the-aic-or-bic-be-used-instead-of-the-train-test-split ? I've haven't had any luck with it for a while. – Skander H. Jan 26 '18 at 21:42
  • 1
    @LiKao Can you cite references on the "techniques" of hidding parameters, like the case of intersecting digits. – horaceT Sep 01 '19 at 22:16
  • @horaceT Unfortunately I don't know of any paper, that gives this example. In papers on MDL there is the notion of "functional complexity" (e.g. https://lpl.psy.ohio-state.edu/documents/MNP.pdf see eq 10). Often the example is made with constrained parameters (e.g. https://www.researchgate.net/publication/262681878_Generalized_outcome-based_strategy_classification_Comparing_deterministic_and_probabilistic_choice_models). I like to turn the example around when discussing this, and show that a complex single parameter can capture multiple simple parameters because I find it more intuitive. – LiKao Sep 09 '19 at 12:04
  • @horaceT Also, if you like a more mathematical treatment, consider that it is proven that filling curves exist, i.e. there is a bijection $f_{1,2}:\mathbb{R} \rightarrow \mathbb{R}^2$. This bijection can easily be used to define bijections $f_{1,N}:\mathbb{R}\rightarrow \mathbb{R}^N$. So for any model with $N$ parameters, I can use $f_{1,N}$ to first get an $N$ dimensional vector from my single parameter, then supply this vector as a parameter to the $N$ parameter model. This gives me a functional equivalent $1$ parameter model. Fitting that model, however would be very complicated at least. – LiKao Sep 09 '19 at 12:11
  • @LiKao This' quite fascinating. Pls reference said proof of "filing curves". I could see that constrained parameters has "less" degree of freedom. Naively, if f(x,y) = 0, y is just a function of x; you just put g(x) where y is. Can't you do similar things with constrained optimization. – horaceT Sep 09 '19 at 21:37
  • @horaceT It has been a while, so I had to look it up. According to https://arxiv.org/pdf/1003.1467.pdf the initial proof goes back to cantor in 1877, but it fails to give a correct citation and I can't find any further references to that. Also, it seems I was wrong when I considered the simple interleaving of digits to be a bijection (see https://math.stackexchange.com/questions/183361/examples-of-bijective-map-from-mathbbr3-rightarrow-mathbbr/183383#183383). Finally, it seems the axiom of choice is needed (https://www.reddit.com/r/askscience/comments/6opvae/simplest_proof_that_r2_r/dkjftan/). – LiKao Sep 10 '19 at 09:42
  • @horaceT The most proofs I could find, simply define two injections $\mathbb{R} \rightarrow \mathbb{R}^2$ and $\mathbb{R}^2 \rightarrow \mathbb{R}$, which shows that both sets have the same cardinality. For any two sets of the same cardinalty, there must be a bijection (Schröder–Bernstein theorem), so this proves (non-constructively) the existence of the proposed bijection. The paper I linked above propeses a proof that this bijection can't be continous, but that is ok, because we don't need continuity. – LiKao Sep 10 '19 at 09:50
  • @horaceT There are also continous maps $\mathbb{R}\rightarrow\mathbb{R}^2$ (e.g. by taking an bijection $\mathbb{R} \rightarrow [0,1]$, for example the logistic function and combining it with the peano curve), but these won't be injective. For the argument I made, this is fine, we don't even need a bijection to unpack the parameters. However, if we use the peano curve, then this would mean an identifiable $N$-parameter model can be re-written as a non-identifiable $1$-parameter model. This just makes the idea more weird, because usually non-identifiability is linked to too many parameters. – LiKao Sep 10 '19 at 09:56
  • @horaceT Also, I noticed a type in the comment. I meant to write "space-filling curves" instead of "filling curves". If you use the correct term, google will give you many more references. – LiKao Sep 10 '19 at 10:01
  • @LiKao Great stuff you have here. I will read up on it a bit, but it seems quite technical..... An aside, the admin folks here at stackoverflow will complain that we used up comment space for such extensive discussion. I think it's better if you reformulate and make it an addendum/an edit to your original answer. A lot of folks will benefit from this rather unusual perspective on AIC. – horaceT Sep 10 '19 at 16:10
  • @LiKao Two more things. Pls give examples/citations where simply counting parameters leads to wrong model selection decision. Secondly, identifiability is relatively unimportant with modern deep neural nets. As long as you find a good model, no one cares whether another model exists with different configuration. Pls comment whether this view is correct. – horaceT Sep 10 '19 at 18:18
  • @horaceT "Wrong" model selection is a bit of a complicated topic, if one assumes that "all models are wrong". However, https://stats.stackexchange.com/questions/52404/comparison-between-mdl-and-bic/246611 is a question on CV where MDL (functional complexity) and BIC (parameter counting) give conflicting results. http://fusion.isif.org/proceedings/fusion05CD/Content/papers/650950f44b0416ea9cf0a134006d.pdf indicates that MDL performs better than BIC in their recovery simulation, showing that parameter counting is insufficient. – LiKao Sep 11 '19 at 08:32
  • @horaceT I am unfortunately not aware of any papers which demonstrate the "space filling curve trick", mostly because I came up with this example by myself. I might actually try it out, if I ever have the time, but currently I have too much else to do. The argument is the groundwork for a mathematical proof and in mathematics a proof beats a practical demonstration. NOTE: this trick is about theoretical models, not about model implementations. In practice we have to approximate models (we can't even do $f(x)=x/5$ in the computer). Especially space filling curves can only be approximated. – LiKao Sep 11 '19 at 08:39
  • @horace Finally: Identifiability is often not a problem. If you just care about prediction, you are fine if you find $\theta$ such that your model performs well, and you don't care about $\theta'$ that performs the same. But identifiability is related to complexity. Assume you just make a more complex model by adding an unused parameter. Model becomes unidentifiable, because you have used a too complex representation. Same if reflections of $\theta$ give the same results. You could instead use equivalence classes over the original space (lower complexity) and get the same model. – LiKao Sep 11 '19 at 08:44
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/98526/discussion-between-likao-and-horacet). – LiKao Sep 11 '19 at 08:44
  • I have thought before about mapping of multiple parameters into single parameters. https://stats.stackexchange.com/a/364137/164061 I wonder what it means for the number of parameters, or degrees of freedom, in relation to BIC and AIC. I guess that one can impose reasonable limitations to the type of models. I personally always make a mental image of the space of the observations and our model is a (multidimensional) surface in that space trying to fit the results/observations as close as possible. So it is not about the number of parameters but about the dimension of the solution space. – Sextus Empiricus Sep 16 '19 at 15:29