Differences between approaches to exponential regression

Question

One could fit an exponential in many different ways. This post suggests doing the down-and-dirty lm on the log of the response variable. This SO post suggests using nls which requires a starting estimate. This SO post suggests glm with a gamma/log link function. Here, the illustrious @Glen-b explains some potential differences between approaches.

What are the pros/cons and domains of applicability for these different approaches? Do these methods differ in how well or in what way they calculate confidence intervals?

Like all the other data scientists at home right now, I'm messing around with Covid 19 data.

One thing in particular I noticed is that I can do lm with log, log10, log2 etc., but would have to convert from natural log with glm.

last_14 = data.frame(rbind(
c(3460,  14,    0),
c(3558,  17,    1),
c(3802,  21,    2),
c(3988,  22,    3),
c(4262,  28,    4),
c(4615,  36,    5),
c(4720,  40,    6),
c(5404,  47,    7),
c(5819,  54,    8),
c(6440,  63,    9),
c(7126,  85,   10),
c(7905, 108,   11),
c(8733, 118,   12),
c(9867, 200,   13)))
names(last_14) = c('World', 'US', 'days')

lm(log(World) ~ days, last_14)
#> 
#> Call:
#> lm(formula = log(World) ~ days, data = last_14)
#> 
#> Coefficients:
#> (Intercept)         days  
#>     8.06128      0.08142

glm(formula = World ~ days,  data=last_14, family=gaussian(link='log'))
#> 
#> Call:  glm(formula = World ~ days, family = gaussian(link = "log"), 
#>     data = last_14)
#> 
#> Coefficients:
#> (Intercept)         days  
#>     8.00911      0.08819  
#> 
#> Degrees of Freedom: 13 Total (i.e. Null);  12 Residual
#> Null Deviance:       54450000 
#> Residual Deviance: 816200    AIC: 199.4

nls(World ~ exp(a + b*days), last_14, start=list(a=5, b=0.03))
#> Nonlinear regression model
#>   model: World ~ exp(a + b * days)
#>    data: last_14
#>       a       b 
#> 8.00911 0.08819 
#>  residual sum-of-squares: 816246
#> 
#> Number of iterations to convergence: 8 
#> Achieved convergence tolerance: 1.25e-06

^{Created on 2020-03-20 by the reprex package (v0.3.0)}

I am not sure about the answer to the question (it is *very* broad). But regarding your little problem with `nls` you can try using the formula $$a \cdot \text{exp} (b \cdot \text{days})$$ instead of $$ \text{exp} (b \cdot \text{days})$$ For instance, this code will work: `nls(World ~ a*exp(b*days), last_14, start=list(a=100000, b=0.3))` — Sextus Empiricus, Mar 20 '20 at 21:16

Demetri Pananos · Accepted Answer · 2020-03-21T16:32:12.033

One of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.
So far as I understand, nls assumes a Gaussian likelihood as well, so in this model $ y \vert t \sim \mathcal{N}(\exp(\beta_0 + \beta t), \sigma^2)$. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood. Furthermore, if you fit your data using nls, you'll see that it fits later data very well but not early data. That is because misfitting later data incurrs larger loss and the goal of nls is to minimize this loss.
The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. In this model, $y \vert t \sim \text{Gamma}(\mu(x), \phi)$ with $\mu(x) = g^{-1}(\beta_0 + \beta_1)$. We call $g$ the link, and for the case of log link $\mu(x) = \exp(\beta_0 + \beta_1 t)$. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

Great job! If you (or anyone else) has more details to add, I think many would benefit. — abalter, Mar 20 '20 at 21:34
Hey, why is `glm` giving me different regression coefficients than `lm` with log? — abalter, Mar 20 '20 at 21:57
@abalter Aha! You've fallen trap to the con I've listed! The coefficients are on the scale of the link, not on the scale of the data. You're going to have to apply the inverse link to the coefficients to get their effect on the scale of the data. See log odds ratios in logistic regression, for example. — Demetri Pananos, Mar 20 '20 at 21:59
@abalter you have been using glm with a gamma distribution of the errors. When you use Normal distributed errors (with log link) then you get the same as using lm with log. — Sextus Empiricus, Mar 20 '20 at 22:02
@SextusEmpiricus -- that's fascinating. They come out so different! And, of course, this is actually count data, so far without replacement (no reinfects as yet). That should provide additional insight into the proper model. — abalter, Mar 20 '20 at 22:04
@actually I made a mistake in my last comment. When you use the lm with log then you take the logarithm of the outcome variable, and thus also transform the errors. When you do glm with gaussian and log link, then you *only* transform the mean, and not the errors. So glm with gaussian and log link turns out the same as nls. — Sextus Empiricus, Mar 20 '20 at 22:06
So `glm(US ~ days, family = gaussian(link = "log"))` gives the same result as `nls(US ~ a*exp(b*days), start = list(a=100,b=0.3))` except that the coefficient is transformed, but they have the same value when converted back. — Sextus Empiricus, Mar 20 '20 at 22:09
@SextusEmpiricus I have confirmed that, and actually updated my reprex. HOWEVER, I still don't get why `lm` with log gives me completely different coefficients. How do I convert between the two? — abalter, Mar 20 '20 at 22:30
DOH!!!! I had "US" in the `lm` and "World" in the others!!! Now they are in the same ballpark. BUT your explanation IS super helpful! — abalter, Mar 20 '20 at 22:40

Sextus Empiricus · Answer 2 · 2020-05-13T11:29:32.347

A known difference between fitting an exponential curve with a nonlinear fitting or with a linearized fitting is the difference in the relevance of the error/residuals of different points.

You can notice this in the plot below.

In that plot you can see that

the linearized fit (the broken line) is fitting more precisely the points with small values (see the plot on the right where the broken line is closer to the values in the beginning).

the non linear fit is closer to the points with high values.

modnls <- nls(US ~ a*exp(b*days), start=list(a=100, b=0.3))
modlm <- lm(log(US) ~ days )
plot(days,US, ylim = c(1,15000))
lines(days,predict(modnls))
lines(days,exp(predict(modlm)), lty=2)
title("linear scale", cex.main=1)
legend(0,15000,c("lm","nls"),lty=c(2,1))

plot(days,US, log = "y", ylim = c(100,15000))
lines(days,predict(modnls))
lines(days,exp(predict(modlm)), lty=2)
title("log scale", cex.main=1)

Getting the random noise modeled correctly is not always right in practice

In practice the problem is not so often what sort of model to use for the random noise (whether it should be some sort of glm or not).

The problem is much more that the exponential model (the deterministic part) is not correct, and the choice of fitting a linearized model or not is a choice in the strength between the first points versus fitting the last points. The linearized model fits very well the values at a small size and the non-linear model fits better the values with high values.

You can see the incorrectness of the exponential model when we plot the ratio of increase.

When we plot the ratio of the increase, for the world variable, as function of time, then you can see that it is a non-constant variable (and for this period it appears to be increasing). You can make the same plot for the US but it is very noisy, that is because the numbers are still small and differentiating a noisy curve makes the noise:signal ratio larger.

(also note that the error terms will be incremental and if you really wish to do it right then you should use some arima type of model for the error, or use some other way to make the error terms correlated)

I still don't get why lm with log gives me completely different coefficients. How do I convert between the two?

The glm and nls model the errors both as $$y−y_{model}∼N(0,\sigma^2)$$ The linearized model models the errors as $$log(y)−log(y_{model})∼N(0,\sigma^2)$$ but when you take the logarithm of values then you change the relative size. The difference between 1000.1 and 1000 and 1.1 and 1 is both 0.1. But on a log scale it is not the same difference anymore.

This is actually how the glm does the fitting. It uses a linear model, but with transformed weigths for the errors (and it iterates this a few times). See the following two which return the same result:

last_14 <- list(days <- 0:13,
                World <- c(101784,105821,109795, 113561,118592,125865,128343,145193,156094,167446,181527,197142,214910,242708),
                US <- c(262,402,518,583,959,1281,1663,2179,2727,3499,4632,6421,7783,13677))
days <- last_14[[1]]
US<- last_14[[3]]
World <- last_14[[2]]


Y <- log(US)
X <- cbind(rep(1,14),days)
coef <- lm.fit(x=X, y=Y)$coefficients
yp <- exp(X %*% coef)
for (i in 1:100) {
  # itterating with different
      # weights
      w <- as.numeric(yp^2)          
      # y-values
      Y <- log(US) + (US-yp)/yp
  # solve weighted linear equation
  coef <- solve(crossprod(X,w*X), crossprod(X,w*Y))
  # If am using lm.fit then for some reason you get something different then direct matrix solution
  # lm.wfit(x=X, y=Y, w=w)$coefficients
  yp <- exp(X %*% coef)
}
coef
# > coef
#           [,1]
#      5.2028935
# days 0.3267964

glm(US ~days,  
    family = gaussian(link = "log"), 
    control = list(epsilon = 10^-20, maxit = 100))

# > glm(US ~days,  
# +     family = gaussian(link = "log"), 
# +     control = list(epsilon = 10^-20, maxit = 100))
#
# Call:  glm(formula = US ~ days, family = gaussian(link = "log"), control = list(epsilon = 10^-20, 
#    maxit = 100))
#
# Coefficients:
# (Intercept)         days  
#      5.2029       0.3268  
#
# Degrees of Freedom: 13 Total (i.e. Null);  12 Residual
# Null Deviance:        185900000 
# Residual Deviance: 3533000    AIC: 219.9

Conditional mean for the log linear model is known to be off by a factor proportional to exp(RMSE). See [here](https://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/). Would you mind correcting the predictions for that model and replacing the plots? — Demetri Pananos, Mar 20 '20 at 21:56
@DemetriPananos I do not understand what you mean. I just plotted the results from two different least squares fits, the one linearized and the other not, — Sextus Empiricus, Mar 20 '20 at 22:00
`exp(predict(modlm)` is not the conditional mean on the natural scale for this model. As is argued in that blog post, you need to multiply by a factor that looks like `exp(RMSE^2/2)`. — Demetri Pananos, Mar 20 '20 at 22:03
@DemetriPananos but lm is predicting the conditional mean on the log scale. I am only transforming this prediction on the log scale. I am not sure what you mean that the scaling with that factor is supposed to do. — Sextus Empiricus, Mar 20 '20 at 22:11
Are you not transforming the predictions to the original scale by doing `exp(predict(modlm))`. The relevant part of the link I've provided is "We, however, have no real interest in E(ln(yj)). We fit this log regression as a way of obtaining estimates of our real model, namely yj = exp(b0 + Xjb + εj) So rather than taking the expectation of ln(yj), lets take the expectation of yj. " — Demetri Pananos, Mar 20 '20 at 22:13
@DemetriPananos yes I transform the predictions to a different scale. — Sextus Empiricus, Mar 20 '20 at 22:14
Continued from my last comment ... "People who fit log regressions know about this — or should — and know that to obtain predicted yj values, they must Obtain predicted values for ln(yj) = b0 + Xjb. Exponentiate the predicted log values. Multiply those exponentiated values by exp(σ2/2), where σ2 is the square of the root-mean-square-error (RMSE) of the regression. " — Demetri Pananos, Mar 20 '20 at 22:15
Normally when I am fitting an exponential curve, then I just want to know the parameters a and b of the function for y = a * exp(b* t) I do not see what you are suggesting to change to the estimates of a and b. It sounds like the idea that the mean of a lognormal distribution is not the log of a related normal distribution but differs with a term $\sigma$. I doubt that this works as well when you do a fit of an exponential curve. Estimating the mean of a lognormal distribution is not the same as estimating the conditional mean of a conditional lognormal distribution. — Sextus Empiricus, Mar 20 '20 at 22:17
They don't change the parameters. You plot predictions to compare between models, it would be better to see what the results from the process I describe are, since those are the conditional mean of the y on the original scale (which is what I think we're after here, predictions, not just estimates). — Demetri Pananos, Mar 20 '20 at 22:19
I am just explaining that the different situations tend to fit different parts of the curve. The one fits better the lower values and the other fits better the higher values. If one chooses to use a fit of linearized curves, then this has reasons which might invalidate the correction with the $\sigma^2/2$ term (for instance because the error is not homogenous on the linear scale). It is also an open question whether one should wish to predict the mean or maybe, instead, some other value (e.g. the median). — Sextus Empiricus, Mar 20 '20 at 22:27
just found your great answer. I got confused by the same question for a long time. thank you for writing for the details of how glm work — Haitao Du, May 13 '20 at 10:10
I am still confused, would you help me to write down [the objective function for these 3 cases](https://stats.stackexchange.com/questions/466212/what-is-the-objective-function-to-optimize-in-lm-glm-with-gaussian-and-poisson)? — Haitao Du, May 13 '20 at 10:56

Differences between approaches to exponential regression

2 Answers2

Getting the random noise modeled correctly is not always right in practice

Linked