Choosing the right forecast model for exponential data (COVID19) forecast package R

Question

I am trying to forecast aggregated daily COVID cases in Europe. These are present day numbers in Italy.

temp <- c(0    , 0    , 0  ,   0  ,   0   ,  0   ,  0  ,   0 ,    0,     2,
 2    , 2   ,  2 ,    2   ,  2 ,    2   ,  3  ,   3 ,    3,     3,
 3  ,   3  ,   3   ,  3  ,   3   ,  3,     3 ,    3   ,  3  ,   3,
20   , 62  , 155 ,  229 , 322  , 453   ,655  , 888,  1128 , 1694,2036 , 
2502 ,3089 , 3858,  4636 , 5883 , 7375,  9172, 10149, 12462,12462)

My problem is that all the models underestimates the exponential growth patterns as this one with exponential smoothing. (if I try to predict using data until 4636 value, the different models estimates 8-9,0000 when the real number was 12,462). I have tried transformations, different models etc.

library(data.table)
library(tidyverse)
library(forecast)
library(lubridate)

COVfirst <- min(which(temp > 0))+22 #starts 22 day in january


temp2 <- ts(temp, start = c(2020, 22), 
            frequency = 365.25)

temp2 %>% autoplot

test <- ets(temp2,
            allow.multiplicative.trend =TRUE)


test %>% forecast(., h = 14) %>% autoplot()


ts_Italy_confirmed <- temp2
forecast_italy_Confirmed <- test %>% forecast(., h = 14)

I a little confounded by this, because the development until present day is actually pretty straight forward (exponential). I don't like fitting a exponential regression model as this will not catch up when the exponential part of the epidemic stops. (I think)

Your last entry for Italy is wrong, it went over 15,000 yesterday. If you’re getting the data from Johns Hopkins, you should note that it didn’t update the data correctly yesterday with several countries listing the same number of cases as the day before. — Mooks, Mar 13 '20 at 08:36
Have double checked. For Italy the correct value for the 12th of March 2020 = 15,113 — Mooks, Mar 13 '20 at 09:06
When you look at curves for previous infections, which happen every year in winter (and which give you information for your predictions) then you see that the seasonal pattern shows that the exponential growth is only accurate for a short period of time for the onset of the outbreak, but should *not* be used for the entire evolution of the outbreak. **Do not use an exponential model. It is wrong/limited** — Sextus Empiricus, Mar 15 '20 at 14:29
Thank you Sextus. In this case I presume it will run for longer than the typical seasonal, but I am very much open for suggestions for improvements!? — Jakn09ab, Mar 15 '20 at 19:15
@Jakn09ab the transfer rate of a virus infection is a function of time and should *not* be considered continuous. The question is what sort of function this is. This virus is new and it seems impossible to accurately predict the development, but you can use information about similar diseases (other infections of the lungs, of which there are many and covid-19 is just the tip of the iceberg) and use that as input for predictions.... — Sextus Empiricus, Mar 16 '20 at 11:05
...You do have a good point that this particular virus may last a bit longer than other already known viruses. Indeed there are *two* effects that suppress the spread of a virus. I just spoke about the seasonality of the virus and the non-homogeneous infection rate. The other way how virus spread reduces in time is due to the reduction in the amount of people among which the virus can spread and this goes faster when a lot of people have alrevdy immunity against tis virus, but that is not the case with covid-19.... — Sextus Empiricus, Mar 16 '20 at 11:07
.... although I would argue that a large part of the population might be not/less susceptible to the virus and it would be wrong to consider a model that lets the virus spread among the entire population. For insttance, fitting with a SIR model where $N$ is the *entire* population ( https://stats.stackexchange.com/questions/448320/coronavirus-growth-rate-and-its-possibly-spurious-resemblance-to-vapor-pressure ) instead of making $N$ a fitting parameter, is overestmating the capacity of the spread of the disease. — Sextus Empiricus, Mar 16 '20 at 11:13
@SextusEmpiricus wouldn’t a logistic curve make reasonable sense here? — Mooks, Mar 17 '20 at 15:11
@Mooks a data-driven (interpopation not extrapolation) and theory-based (ideally first principles based) curve would make sense. Using a logistic curve, just because it somewhat resembles the growth elsewhere... I see no reason why. It will just create bias in a different way. *Because of *what* reason should we believe that it needs to be a logistic curve? If there is no reason then a logistic curve adds no value. — Sextus Empiricus, Mar 17 '20 at 15:35
@SextusEmpiricus I was just thinking because a logistic curve has been used as a reasonable approximation to various rate limited growth processes - perhaps it’s been used in epidemiology but I imagine they’re doing more sophisticated things generally. I’m not saying it’s exactly the right model, but as a first pass approximation it seems reasonable and with simple physical interpretation(a). Certainly it seems more reasonable than modelling exponential functions, which we know must be more incorrect than a logistic curve over the mid-long run. — Mooks, Mar 18 '20 at 11:39
From what I have been able to gather from publications regarding predictions i is possible to predict 1 maybe two weeks max of these spreads. This recent publication from Imperial College also models the spread in UK. https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/Imperial-College-COVID19-NPI-modelling-16-03-2020.pdf Scary stuff. — Jakn09ab, Mar 18 '20 at 20:25
I did something similar for Germany. For more information on my model, click the link "background information" at the top left of the page: https://peter.pollak-diener.de/covid19-forecast/ Sourcecode is pure javascript and open source. Link is in description. You can copy and adapt this model for other countries. — Peter, Mar 30 '20 at 20:20
@Mooks [here](https://www.tue.nl/en/news/news-overview/11-03-2020-eindhoven-data-scientists-take-on-corona-data-to-predict-growth-of-new-infections/) they used logistic curves as a fit. But the predictions are only made 3 days forward in time. Such fits can be made with many other curves as well. — Sextus Empiricus, Mar 31 '20 at 14:11
@SextusEmpiricus ok thanks. Just as a slight aside and maybe you’ve already seen it. But in the intervening time I stumbled across this prediction model from Iceland (but the Shiny app has other countries). Thought you might be interested if you hadn’t seen it: [here](https://rpubs.com/bgautijonsson/HierarchicalLogisticGrowthCurves). — Mooks, Mar 31 '20 at 16:55

Stephan Kolassa · Accepted Answer · 2020-03-31T07:05:45.530

12

You can force ets() to use a model with multiplicative trend (and multiplicative error) by using the parameter model="MMN". Of course, you need to start the series later, since multiplicative trends and errors don't make sense for zero values.

temp3 <- ts(temp[-(1:9)], start = c(2020, 32), 
            frequency = 365.25)
test <- ets(temp3,model="MMN")
test %>% forecast(., h = 14) %>% autoplot()

I certainly hope this graphic is what you wanted.

It also illustrates why ets() is very careful about fitting multiplicative trends on its own. They can and will explode. Also:

I don't like fitting a exponential regression model as this will not catch up when the exponential part of the epidemic stops.

Of course, ets() will not know when to stop extrapolating the exponential growth, so this (extremely correct) rationale applies equally to ets(). You may want to consider models that are explicitly tailored towards epidemiology or (market) penetration, like the Bass diffusion model or similar.

EDIT: Rob Hyndman explains in more depth why smoothing and similar models do not make a lot of sense to forecast COVID-19, and gives pointers to more appropriate models. And here is Ivan Svetunkov.

edited Mar 31 '20 at 07:05

answered Mar 13 '20 at 08:07

Stephan Kolassa

95,027
13
197
357

I can see how that would work. I did actually remove the zero values initially but put them back in because it made it difficult to put the forecasts of individual countries back together. Or maybe on second thought, that is a wrong assumption? - the forecast will be of the same length.. – Jakn09ab Mar 13 '20 at 08:18
would you know if this is implemented in the forecast package? I can't find it. – Jakn09ab Mar 13 '20 at 08:26
2

It looks like there are no diffusion models in the `forecast` package, nor in the newer `fable` package (if you are working in the tidyverse, the latter might fit better to your workflow). There is a `diffusion` package, which looks like it might be helpful, but I don't know it. – Stephan Kolassa Mar 13 '20 at 08:39
1

@Jakn09ab If you write a function that takes a country's data, linearly regresses the logarithms of counts once they become nonzero, and gets the fit's parameters (which will be sensitive to the date when nonzero counts start), you could get together all the nations' exponential models and plot them on common axes. You needn't even use this linearization technique, if you have access to a preferred algorithm that can return the parameters themselves, rather than just producing an image. – J.G. Mar 13 '20 at 21:08
Might of interest these two souces: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3551626&fbclid=IwAR1iKf4UsThjU8XrCwMHngjWwOVN2CIXtVWL1pQsZQ0eqJnhKpQ_wBCpPRw; https://github.com/vlavorini/covid-19 – Jonas Striaukas Mar 18 '20 at 21:08
+1, especially for the edit. I'd write the edit in bold. – Tim Mar 30 '20 at 20:30
@Tim: your wish is my day planner. – Stephan Kolassa Mar 31 '20 at 07:06

score 1 · Answer 2 · answered Mar 18 '20 at 20:31

I suggest using a binary logistic regression model. Calculate p as the proportion of the population infected, p = c / N, l as the link function, for example, l = ln(p/(1-p)). Then use ordinary least squares, again, for example, to find l_hat = f(t). Next, use the reverse link function p_hat = exp(l_hat)/(1+exp(l_hat)). Then convert the estimated proportion, p_hat, into a case count, c_hat = p_hat * N.

At each step, there are other choices you could make. A different link function or a different regression method come to mind.

You could evaluate the quality of your estimate graphically by comparing the number of cases and the estimate, the proportion and the estimate, the logit and its estimate (or other link function).

Good luck and stay safe.

John

Choosing the right forecast model for exponential data (COVID19) forecast package R

2 Answers2