What does deviance represent: a number of points?

Question

I read very carefully the answer to the question "what is deviance?" already asked here, and I understand how to calculate and use deviance for model comparison, for example.

However, a question remains as to what deviance actually is. As I was fitting a multinomial model to my data with either one or several predictors, the deviance for the most complex model was smaller by about 10000 and I was told the two models had "almost 10000 points difference in log-likelihood", which raised the question, does this number represent a number of points / a difference in number of points?

I originally thought about some sort of difference in the number of well-classified outcomes between the two models. But, taking the example of another dataset I had with a total of 15707 points, 10 response categories, and depending on how many predictors I used the deviance values varied of course, but they were all around 15000, 16000 or 20000 - so way more "points" than I am analysing in my original sample.

I just dont understand what these points are. Any additional explanations on this would be great... thank you.

I'd guess whoever said that was using "points" in the loose sense of "score" or "measure". Deviance certainly isn't the number of data-points in your sample or anything like that. And what it is is answered rather well in the linked post, I think; _what_ question remains specifically? — Scortchi - Reinstate Monica, Dec 04 '13 at 09:52
I think the reason why I fall short of fully understanding the previous answers is because I am not familiar at all with regression/ classification trees. I checked the papers suggested but I do not understand what the "nodes" are. Particularly, what is the "original" node and what are the terminal ones. And how they relate to the likelihood? Do all nodes represent all possible probability distributions for each point? — Neodyme, Dec 05 '13 at 07:26
I am performing ordinal / multinomial regression so trying to understand deviance in this context - maybe a simple worked example would help if available somewhere. — Neodyme, Dec 05 '13 at 07:30

score 1 · Answer 1 · answered Dec 12 '17 at 02:44

I'm unsure exactly what you are referring to with the idea of "points". As discussed in the first section of the linked post, deviance is a technique for comparing nested models based on their log-likelihoods. This technique is not restricted to classification trees.

There is a good discussion in MLE in Laymans Terms for intuition in interpreting maximum likelihoods. Saying two log-likelihoods differ by 20 "points" seems no more sensical than saying the numbers 10.7 and 30.7 differ by 20 "points".

The article on Wikipedia and the original paper Generalized Linear Models both define the concept, but the most motivated introduction I have found is in McCullagh and Nelder (1989):

Having selected a particular model, it is required to estimate the parameters and to assess the precision of the estimates. In the case of generalized linear models, estimation proceeds by defining a measure of goodness of fit between the observed data and the fitted values generated by the model. The parameter estimates are the values that minimize the goodness-of-fit criterion. We shall be concerned primarily with estimates obtained by maximizing the likelihood or log likelihood of the parameters for the data observed.

...

There are advantages in using as the goodness-of-fit criterion, not the log likelihood $l(\mu;y)$ but a particular linear function, namely

$$ D^{*}(y; \mu) = 2l(y;y) - 2l(\mu;y) $$

which we call the scaled deviance. Note that, for the exponential-family models considered here, $l(y;y)$ is the maximum likelihood achievable for an exact fit in which the fitted values are equal to the observed data. Because $l(y;y)$ does not depend on the parameters, maximizing $l(\mu; у)$ is equivalent to minimizing $D^*(y; \mu)$ with respect to $\mu$, subject to the constraints imposed by the model.

The justification for comparing to $l(y;y)$ is also discussed

Given n observations we can fit models to them containing up to n parameters. The simplest model, the null model, has one parameter, representing a common $\mu$ for all the ys; the null model thus consigns all the variation between the ys to the random component. At the other extreme the full model has n parameters, one per observation, and the $\mu$s derived from it match the data exactly. The full model thus consigns all the variation in the ys to the systematic component leaving none for the random component.

In practice the null model is usually too simple and the full model is uninformative because it does not summarize the data but merely repeats them in full. However, the full model gives us a baseline for measuring the discrepancy for an intermediate model with p parameters.

Gelman and Hill (2007) also discusses the following properties of deviance

Deviance is a measure of error; lower deviance means better fit to data.

If a predictor that is simply random noise is added to a model, we expect deviance to decrease by 1, on average.

When an informative predictor is added to a model, we expect deviance to decrease by more than 1. When k predictors are added to a model, we expect deviance to decrease by more than k.

References

J. A. Nelder and R. W. M. Wedderburn (1972). Generalized Linear Models. Journal of the Royal Statistical Society, 135(3), 370-384

McCullagh and Nelder (1989). Generalized Linear Models (2nd ed.). Chapman and Hall. pp. 23-25, 33-36

Gelman and Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press. pp. 100

What does deviance represent: a number of points?

1 Answers1

References