An appropriate likelihood for reporting disease

Question

Wuhan flu is a very hot topic right now, and I have seen at least two papers and several bloggers attempt to model the $\mathcal{R}_0$ of the disease using the reported data on wikipedia.

The problem I have is that most of these approaches use a Guassian likelihood to get point estimates, and while that may be a good approximation, the uncertainty estimates would be more than wrong. Consider for a moment that if there are $x$ reported cases of the flu then there can be more cases we have not seen, but not fewer. The likelihood for any model should respect this property (that our observations can under estimate the true prevalence of the disease at a given time, but can not over estimate it).

What would be an appropriate likelihood for this process? I imagine it would almost look like a flipped pareto distirbution. Interested in hearing your thoughts.

score 2 · Answer 1 · edited Jun 11 '20 at 14:32

Too early to fit models with $R_0$ (with only data about the number of infected people).

The larger problem is that the development is still in the exponential phase and we can not model the decrease of the exponential growth rate. See also this question: Fitting SIR model with 2019-nCoV data doesn't conververge In that framework (SIR model) the only variable that can be accurately determined is $\beta - \gamma$ but $R_0 = \beta/\gamma$ is still difficult to determine (but it seems to be $\beta-\gamma \approx 0.4 > 0$ thus we at least know that $R_0>1$).

See also the comment at the end of that answer about the use of only a time series in the number of infected people. That number does not provide a good measure of $R_0$, but it is auxiliary information that is being used. For instance direct estimates of $\gamma$ the parameter relating to how long people remain infectious (by observing how the virus has spread, rather than just using the quantitative numbers).

regarding your problem/question

Binomial distribution

One might model the reported cases as a binomial distributed variable. The observed/reported cases will be some fraction $p$ of the underlying real cases (in more advanced models this fraction could be variable depending on time and the number of cases). But the current data do not allow to study this (potentially one can fill in the gaps based on studies on a wider set of epidemical data).

Given such situation (a binomial distributed fraction) one would expect that the variance is related to the mean by a factor between 0 and 1 (binomial distribution has $\mu = np$ and $var = npq$). However when we plot an exponential curve to the data then we currently get RSS being a factor 4 larger than the sum of means. Thus there is posibly overdispersion instead of underdispersion (although it is not yet a good accurate measure*).

*With so little data we can not be sure. In addition the model is not so great (as you say the approaches use simple Gaussian likelihood). One might use some correlation in the errors. I imagine that this is a potential source of overdispersion.

## data
Infected <- c(45, 62, 121, 198, 291, 440, 571, 830, 1287, 1975, 2744, 4515)
day <- 0:(length(Infected)-1)

## exponential model
mod <- nls(Infected ~ a*exp(b*day), 
           start = list(a = Infected[1],
                        b = log(Infected[2]/Infected[1])))
mod

plot(day,Infected, log = "y")
lines(day,predict(mod))
title("exponential fit")

## residuals versus mean
plot(predict(mod),abs(Infected-predict(mod)), log="xy",
     xlim = c(10^1,10^5), ylim = c(1,10^4),
     xlab = "predicted",
     ylab = "abs(observed-predicted)")
rr <- (10^seq(-1,5,0.1))

## comparing residuals with square root of mean
lines(rr,sqrt(rr),lty=2) # mean
lines(rr,2*sqrt(rr),lty=2) # mean
text(10^4.5,10^2.25,"+1SD",srt=30,pos=1)
text(10^4.5,2*10^2.25,"+2SD",srt=30,pos=3)
title("comparing residuals (obs - pred) \n with square root of mean", cex.main=1)

sum(Infected)
sum((Infected-predict(mod))^2)

Poisson distribution for increments

There is new/recent an R package available on github that aims to model this specific epidemic:

https://rdrr.io/github/chrism0dwk/wuhan/

This model models an ODE according to a SEIR model and uses data that is separated for individual regions (combining this wit hthe population size in those regions and the transport in between those regions).

The predictions from the model are compared with the observations by comparing the increments and assuming that these increments are Poisson distributed.

Is there enough data?

The exponential model in the above graph does not fit accurately the data (although I would say the exponential fit is sufficient, still one might desire some more complex fit to learn more). One could make the fits more complex, possibly the factors that determine the dynamics are not constant in time.

But, one may wonder what could be achieved by making the model more complex. I do not see what could be achieved. The data shows a reasonable exponential curve, and I do not believe that we should try to squeeze more out of the present data.

Potentially, one could incorporate more background information, and then some sort of simulations**, and a more Bayesian approach, could be interesting in order to project into the future what we could see (but this will lean mostly on prior information and those 12 data points do not provide a lot of information).

**Those simulations would also help to handle the correlations mentioned in the previous footnote.

Some interesting view of the data might be the distribution of the increases in the infections from day to day. This increase is not constant in time. We might better regard it as a variable with a random distribution by itself (and model variation in growth curves based on this random variation in the growth rate). Models to use are https://en.wikipedia.org/wiki/Geometric_Brownian_motion (the likelihood computation by Chris Jewell is also treating the evolution as a random walk by evaluating the observed 'step sizes/ increments')

plot(day[-1],Infected[-1]/Infected[-12],
     xlab = "day", ylab = "Infected/Infected previous day")
title("daily relative increase in infections", cex.main=1)

hist(100*(Infected[-1]/Infected[-12])-100, 
     xlab="% increase infections", cex.main = 1,
     main = "histogram daily relative increase")

An appropriate likelihood for reporting disease

1 Answers1

Too early to fit models with $R_0$ (with only data about the number of infected people).

Binomial distribution

Poisson distribution for increments

Is there enough data?