13

For my current reseach I'm using the Lasso method via the glmnet package in R on a binomial dependent variable.

In glmnet the optimal lambda is found via cross-validation and the resulting models can be compared with various measures, e.g. misclassification error or deviance.

My question: How exactly is deviance defined in glmnet? How is it calculated?

(In the corresponding paper "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Friedman et al. I only find this comment on the deviance used in cv.glmnet: "mean deviance (minus twice the log-likelihood on the left-out data)" (p. 17)).

smci
  • 1,456
  • 1
  • 13
  • 20
Jo Wmann
  • 153
  • 1
  • 6
  • It's the same as the deviance used in `glm` (or at least, it should be -- there's only one definition of deviance I'm aware of). – Hong Ooi Oct 03 '14 at 09:30
  • Yes, but i think they extend it in some way as indicated by the quote in my first post. Deviance as I understand can compare the performance of two models but how do the authors include the left-out data of the cross-validation then? How does the "minus twice the log-likelihood on the left-out data" make sense? – Jo Wmann Oct 03 '14 at 09:39
  • If it is as you describe it is probably doing something in the line of following. CV leaves a small segment out for testing. Once they calculate coefficients for each lambda, then they plug in the left-out data and calculate the deviance for each lambda, which is -2*sum(log pdf(test data | coeff)). PDF is probably gaussian. – Cagdas Ozgenc Oct 03 '14 at 10:28
  • 1
    Alright, thanks, now I think I got it: The deviance is defined as -2*log-likelihood or rather (2*log-likelihood)/(log-likelihood of the null-model). This also explains, why their deviance measure for the different values of lambda does not exceed the interval 0,2. The model is estimated on the k-1 folds of the cross-validation and applied to the remaining fold. For the application on the remaining fold the log-likelihood-score is calculated. This is repeated k times and the mean of the k results for each lambda of the above definied deviance measure is returned. – Jo Wmann Oct 03 '14 at 10:59
  • 1
    Yes it is always averaged over all folds for each lambda. I think you can use either the deviance directly or ratio wrt to null model, which is probably the intercept only model. There are two pitfalls: a) folds may not have exact same number of data points b) each fold contains different data (naturally). to fix (a) you may simply divide the deviance by the number of data points in the selected fold. to fix (a) and (b) at the same time use the ratio approach. deviance model assumes that the data set is the same in each model (the same idea in MAP estimate where they ignore the denominator). – Cagdas Ozgenc Oct 03 '14 at 11:44
  • 1
    However once folds get into the picture the denominator is not the same across the folds. So ratio takes care of that by cancelling out the denominators. But I don't know how big of a problem this is when you average over folds. – Cagdas Ozgenc Oct 03 '14 at 11:46
  • Good points. As the authors implemented the deviance measure in the cross-validation, I suppose that they were aware of these pitfalls and therefore indeed used the ratios. My guess is, that they did define deviance as I described before but rather as "minus twice the log of the likelihood ratio" - the likelihood ratio being (logLik(glm.null)-logLik(glm))/logLik(glm.null). I suppose this makes sense the most?! – Jo Wmann Oct 03 '14 at 12:18
  • I don't really follow you. (logLik(glm.null)-logLik(glm))/logLik(glm.null) = 1 - ratio, pretty much the same thing, depending on whether you will maximize or minimize. -2 cancels out, it has no meaning anyhow. In my opinion correct deviance doesn't have the 2 factor. – Cagdas Ozgenc Oct 03 '14 at 12:55
  • The problem I am having is, that the model returns deviance not in absolute numbers which would be a number somewhere in the thousands, but it returns deviance as some positive ratio, typically between 0 and 2. The optimum, i.e. the best fitting model is stated to be the one with the lowest value of this measure or ratio of deviance. – Jo Wmann Oct 03 '14 at 15:46
  • @CagdasOzgenc: to be clear, it seems best practice when running several `cv.glmnet()` calls is to pick one static set of foldids (e.g. using `cvTools::cvFolds()`). Otherwise the results are at the mercy of the random-seeded behavior of the `sample.int()` calls generating the foldids inside `cv.glmnet()`. – smci Feb 07 '17 at 13:28

2 Answers2

11

In Friedman, Hastie, and Tibshirani (2010), the deviance of a binomial model, for the purpose of cross-validation, is calculated as

minus twice the log-likelihood on the left-out data (p. 17)

Given that this is the paper cited in the documentation for glmnet (on p. 2 and 5), that is probably the formula used in the package.

And indeed, in the source code for function cvlognet, the deviance residuals for the response are calculated as

-2*((y==2)*log(predmat)+(y==1)*log(1-predmat))

where predmat is simply

predict(glmnet.object,x,lambda=lambda)

and passed in from the encolsing cv.glmnet function. I used the source code available on the JStatSoft page for the paper, and I don't know how up-to-date that code is. The code for this package is surprisingly simple and readable; you can always check for yourself by typing glmnet:::cv.glmnet.

shadowtalker
  • 11,395
  • 3
  • 49
  • 109
1

In addition to the @shadowtalker 's answer, when I was using the package glmnet, I feel like the deviance in the cross-validation is somehow normalized.

library(glmnet)
data(BinomialExample)

fit = cv.glmnet(x,y, family = c("binomial"), intercept = FALSE)
head(fit$cvm) # deviance from test samples at lambda value

# >[1] 1.383916 1.359782 1.324954 1.289653 1.255509 1.223706

# deviance from (test samples? all samples?) at lambda value
head(deviance(fit$glmnet.fit))

# >[1] 138.6294 134.5861 131.1912 127.1832 122.8676 119.1637

Ref: deviance R document

because if I do the division,

head(deviance(fit$glmnet.fit)) / length(y))

the result is

[1] 1.386294 1.345861 1.311912 1.271832 1.228676 1.191637

which is very close to the fit$cvm.

This may be what the comment from @Hong Ooi said on this question:

https://stackoverflow.com/questions/43468665/poisson-deviance-glmnet

vtshen
  • 419
  • 3
  • 13