I'm unsure exactly what you are referring to with the idea of "points".
As discussed in the first section of the linked post, deviance is a technique
for comparing nested models based on their log-likelihoods. This technique is not restricted to classification trees.
There is a good discussion in MLE in Laymans Terms for intuition in interpreting
maximum likelihoods. Saying two log-likelihoods differ by 20 "points" seems
no more sensical than saying the numbers 10.7 and 30.7 differ by 20 "points".
The article on Wikipedia and the original paper Generalized Linear Models
both define the concept, but the most motivated introduction I have found is in
McCullagh and Nelder (1989):
Having selected a particular model, it is required to estimate the
parameters and to assess the precision of the estimates. In the
case of generalized linear models, estimation proceeds by defining
a measure of goodness of fit between the observed data and the
fitted values generated by the model. The parameter estimates are
the values that minimize the goodness-of-fit criterion. We shall
be concerned primarily with estimates obtained by maximizing the
likelihood or log likelihood of the parameters for the data observed.
...
There are advantages in using as the goodness-of-fit criterion, not the log likelihood $l(\mu;y)$ but a particular linear function, namely
$$
D^{*}(y; \mu) = 2l(y;y) - 2l(\mu;y)
$$
which we call the scaled deviance. Note that, for the exponential-family models considered here, $l(y;y)$ is the maximum likelihood achievable for an exact fit in which the fitted values are equal to the observed data. Because $l(y;y)$ does not depend on the parameters, maximizing $l(\mu; у)$ is equivalent to minimizing $D^*(y; \mu)$ with respect to $\mu$, subject to the constraints imposed by the model.
The justification for comparing to $l(y;y)$ is also discussed
Given n observations we can fit models to them containing up to n parameters. The simplest model, the null model, has one parameter, representing a common $\mu$ for all the ys; the null model thus consigns all the variation between the ys to the random component. At the other extreme the full model has n parameters, one per observation, and the $\mu$s derived from it match the data exactly. The full model thus consigns all the variation in the ys to the systematic component leaving none for the random component.
In practice the null model is usually too simple and the full model is uninformative because it does not summarize the data but merely repeats them in full. However, the full model gives us a baseline for measuring the discrepancy for an intermediate model
with p parameters.
Gelman and Hill (2007) also discusses the following properties of deviance
- Deviance is a measure of error; lower deviance means better fit to data.
- If a predictor that is simply random noise is added to a model, we expect deviance
to decrease by 1, on average.
- When an informative predictor is added to a model, we expect deviance to decrease
by more than 1. When k predictors are added to a model, we expect
deviance to decrease by more than k.
References
J. A. Nelder and R. W. M. Wedderburn (1972). Generalized Linear Models.
Journal of the Royal Statistical Society, 135(3), 370-384
McCullagh and Nelder (1989). Generalized Linear Models (2nd ed.). Chapman and Hall. pp. 23-25, 33-36
Gelman and Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press. pp. 100