Is there a reason why the regression in the R Skellam package uses three optimisation steps?

Question

I am not entirely sure whether this would be better posted in Stack Overflow, or maths.se, however as it is about model fitting I thought I would try here first.

The Skellam distribution $\operatorname{Skel}(\mu_1, \mu_2)$ is the distribution of the difference between two independent Poisson random variables with mean parameters $\mu_1$ and $\mu_2$ respectively. The R package skellam has a regression function skellam.reg which returns MLE estimates of parameters $\beta_{in}$ where $i = 1,2$ in the following regression equations: $$\ln(\mu_1) = \beta_{10} + \beta_{11} x_1 + \beta_{12} x_2 + \dots$$ $$\ln(\mu_2) = \beta_{20} + \beta_{21} x_1 + \beta_{22} x_2 + \dots$$

In the code there are three lines to find the mle estimates:

mod <- stats::nlm(skelreg, stats::rnorm(2 * p), iterlim = 5000)
mod <- stats::nlm(skelreg, mod$estimate, iterlim = 5000)
mod <- stats::optim(mod$estimate, skelreg, hessian = TRUE, 
                        control = list(maxit = 5000))

where the objective function is the log-likelihood given by skelreg(). So it looks like parameter estimation happens in three steps:

Draw the starting parameters from the normal distribution, find the best estimates using the non linear optimiser in 5000 iterations
Reuse the estimates found in the previous step again in the non linear optimiser for another 5000 iterations.
Take the estimates found in the previous step, and run it through R's general optim function.

I don't understand why the optimisation is broken down into these three steps when conceivably it is possible to do one call to nlm with the option hessian = T and iterlim = 15000.

The full code is posted below for reference (in-line comments are mine):

skellam.reg <- function (y, x) 
{
    n <- length(y)
    x <- stats::model.matrix(~., data.frame(x))
    p <- dim(x)[2]
    skelreg <- function(pa) { # return log-likelihood
        b1 <- pa[1:p]
        b2 <- pa[-c(1:p)]
        a1 <- x %*% b1
        a2 <- x %*% b2
        lam1 <- exp(a1) # exp/log? link
        lam2 <- exp(a2) # exp/log? link
        a <- 2 * sqrt(lam1 * lam2)
        sum(lam1 + lam2) + 0.5 * sum(y * (a1 - a2)) - sum(log(besselI(a, 
                                                                      y)))
    }
    options(warn = -1)
    mod <- stats::nlm(skelreg, stats::rnorm(2 * p), iterlim = 5000)
    mod <- stats::nlm(skelreg, mod$estimate, iterlim = 5000)
    mod <- stats::optim(mod$estimate, skelreg, hessian = TRUE, 
                        control = list(maxit = 5000))
    b1 <- mod$par[1:p]
    b2 <- mod$par[-c(1:p)]
    s <- diag(solve(mod$hessian))
    s1 <- sqrt(s[1:p])
    s2 <- sqrt(s[-c(1:p)])
    param1 <- cbind(b1, s1, b1/s1, stats::pchisq((b1/s1)^2, 1, 
                                                 lower.tail = FALSE))
    param2 <- cbind(b2, s2, b2/s2, stats::pchisq((b2/s2)^2, 1, 
                                                 lower.tail = FALSE))
    rownames(param1) <- rownames(param2) <- colnames(x)
    colnames(param1) <- colnames(param2) <- c("Estimate", "Std. Error", 
                                              "Wald value", "p-value")
    list(loglik = -mod$value, param1 = param1, param2 = param2)
}

The author of that function gives their email address in the documentation; I'd be inclined to ask them about their reasoning. — Glen_b, Feb 27 '17 at 06:56

score 3 · Answer 1 · answered Feb 27 '17 at 14:20

3

The reason why I did that was for convergence purposes. I have noticed some times, one nlm is not enough. Using the estimates in a second nlm leads to further optimisations. As for the optim call, optim is more robust and I prefer it, especially for calculation of the Hessian matrix.

answered Feb 27 '17 at 14:20

Michail

65
3

Thanks for your comment! Could you go into more details (what didn't work)? It seems strange that the following lines: `mod – Tim Feb 27 '17 at 14:27
This answer does not make any sense to me. Can you clarify exactly what is going on here?. – mdewey Feb 27 '17 at 14:32
It seems strange I know. nlm might need say 300 iterations, what is the bother for 5000 then? For the difficult cases, of many observations and or many covariates. You can try yourself this with a difficult to optimise function. If you run multiple runs of nlm starting at random places each time, you get different optima. To avoid this, I am doing the trick I have done. – Michail Feb 27 '17 at 15:38
Thanks, it looks like some more testing with hypothetical data is needed to identify areas where `optim` has an advantage over `nlm`. – Alex Feb 27 '17 at 22:26
@Michail I understand what you are saying, but you are starting `nlm` in exactly the same place where previous run has finished, so there is absolutely no reason why it could change anything. Notice that even if some kind of randomization would be done during optimization, then the pseudo-random number generator will have exactly the same seed that follows after the one used at last iteration of the optimization algorithm used in previous run. So it is literally *impossible* that calling `nlm` twice like this could change anything. – Tim Feb 28 '17 at 09:50
Ok, from testing with my real data I can definitely see that the call to `optim` makes a huge difference to the found parameters. This makes sense as the default call to `optim` uses Nelder-Mead, in contrast to nlm which as a Newton-type algorithm relies on approximating derivatives. – Alex Mar 01 '17 at 00:38
@Tim I can send you a few examples to see if you are iterested. Do you agree that multiple runs of nlm can lead to different outcomes? – Michail Mar 01 '17 at 12:15
@Michail as I said, if you use it as in the code above, it simply cannot lead to different results then if single call was used. You use a deterministic optimization algorithm that starts again from the output of previous run of it. You basically "paused" it in the middle of computation. – Tim Mar 01 '17 at 12:48
@Tim i see your point. I will run some more examples and see. – Michail Mar 02 '17 at 17:27

Is there a reason why the regression in the R Skellam package uses three optimisation steps?

1 Answers1