6

I am trying to predict the total sum of donations that Monica will receive on https://www.gofundme.com/f/stop-stack-overflow-from-defaming-its-users/

I copied the data and summed for all days the amount of donations. This results in the following data, plot and analysis:

# data 
# note that the date values are day since beginning of crowd funding
# the value 6085 is the oldest (day 0) and the value 180 is the most recent (day 28)

m <- c(6085,3207,885,1279,1483,75,421,335,1176,504,430,110,36,299,314,215,417,1712,2141,35,235,80,330,70,70,105,65,15,180)
d <- c(0:28)


# plotting
plot(d,m, log = "y",
     xlab = "day", ylab = "$",
     main="daily donation money")

# adding model line
mod <- glm(m ~ d, family = quasipoisson(link='log'))
ds <- seq(0,28,0.1)
lines(ds,exp(coef(mod)[1]
            +coef(mod)[2]*ds))

# integral for fitted line
exp(coef(mod)[1])/-coef(mod)[2]

donations

When I integrate the fitted line until infinity then I get roughly ~ 21650 dollars as the total sum of money that will be donated.

My question is

  1. How can I express the accuracy/variance of this predicted/forecasted value (based on the idea that the model is true)?
  2. How do I incorporate the knowledge that the current sum of the data $\sum m=22309$ is already larger than the prediction/forecast based on the integral of the fitted line?

    • I imagine I could try fit the integral which is something like $\text{final sum} \times ( 1-e^{-ct})$ but I would not know how to treat the errors which will be correlated. And also I still get a small value (in the case below with simple least squares I get the final sum is 21580

      cumulative

      t <- c(0,rev(d+1))
      ms <- cumsum(c(0,rev(m)))
      plot(t,ms, xlab = "day", ylab = "$", main="cumulative donation money")
      mod2 <- nls(ms ~ tot * (1-exp(c*t)), start = list(tot =22000, c = -0.1))
      lines(t,coef(mod2)[1] * (1-exp(coef(mod2)[2]*t)))
      
  3. How should I handle the inaccuracies of my statistical model (In reality I do not have a perfect exponential curve and neither (quasi)Poisson distribution of errors, but I do not know well how to describe it better and how to incorporate these inaccuracies of the model into the error of the prediction/forecast)?


Update:

Regarding the questions 1 and 2

IrishStat commented that

"you might want to accumulate predictions"

So what I did wrong is integrating the estimated values from day 0 to day infinity. But what I should do instead is integrate the estimated values from day 28 onward and add it to the current sum.

So what remains from question 1 and 2 is how to do this for the specific GLM model. If I sum predictions then I need to incorporate errors due to the data being random and due to my estimates being random. How can I add these sources of error together? Can I calculate or estimate this with a short formula or should estimate the error with a simulation?

In addition question 3 remains. (IrishStat seems to suggest that I should treat it as an arima process, but how do I do this with the log-link function and quasi(Poisson) errors? )

In this graph I have colored all Sundays, there seems to be a weekly pattern.

graph with Sundays colored

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • 1
    You definitely don't want to model the cumulative series whereas you might want to accumulate predictions using the observed series https://stats.stackexchange.com/questions/435370/how-can-i-improve-my-arima-forecast-in-r/435651#435651 – IrishStat Nov 27 '19 at 15:21
  • So, I have my model like $\text{donations(t)} = a e^{-bt} + \epsilon$ with errors distributed like (quasi)Poisson. Then I can predict values (including their variance) for all future days and sum them (including the variance). Should I do this with some Monte Carlo model? I have a source of error due to the estimates of $a$ and $b$ and a source of error due to the daily variations, and I see not how I can simply add them. – Sextus Empiricus Nov 27 '19 at 15:39
  • Take the errors from your model ...and resample them to create a family of future values for the next period ...and so on into the future ... make sure that your uncertainty/prediction limits allow for the autoprojective structure . Your model appears to be deterministic rather than auto-projective/adaptive – IrishStat Nov 27 '19 at 15:43
  • That would answer partly the question part 3, but what if the model would be considered to be true? Even then I do not see so easily how to do it. Could there be an analytical solution? – Sextus Empiricus Nov 27 '19 at 15:45
  • not so much ... using what I know ....but there are smarter readers/helpers than I out there... – IrishStat Nov 27 '19 at 15:48
  • you have 29 values ,,, i believe please confirm – IrishStat Nov 27 '19 at 15:57
  • Yes 29 values. I took the time value from 0 to 28 but this can be 1 to 29 if you like. (there are actually already 30 values but this includes donations from today but today is not yet finished). – Sextus Empiricus Nov 27 '19 at 15:58
  • How many periods in the future do you wish to obtain a prediction interval for ...... I will resubmit and generate monte-carlo /simulation prediction intervals which includes an aggregate simulation for the total forecast period. – IrishStat Nov 29 '19 at 22:54
  • @IrishStat I was thinking about a prediction for the *total sum* of donations. For this I was imagining infinity, but to make it physically realistic let's say 1 year. – Sextus Empiricus Nov 30 '19 at 02:24
  • What is your current thinking .. you have been quiet for a few days .... – IrishStat Dec 03 '19 at 01:08
  • @IrishStat I haven't had much time to think about it. But recently I saw this [image](http://www.seismo.ethz.ch/export/sites/sedsite/home/.galleries/img_news_2019/timeline_zoom_web_VS.png) for an earthquake swarm that may resemble the "donation swarm". It made me think that my pretty exponential curve might be too idealistic (although I still wonder how the question part 1 - express the error of prediction, given the model is *true* - can be executed in the best way). Possibly it is very hard to predict with this little data and I would have to get data from other donation projects as well. – Sextus Empiricus Dec 03 '19 at 07:06
  • Please don't treat this as an ARIMA process: that's an implausible model and, without substantial transformations (which I don't recommend), will produce unrealistic results, as IrishStat has demonstrated. – whuber Dec 04 '19 at 05:00
  • what is the url which has the latest donation data ? – IrishStat Dec 04 '19 at 10:00
  • @IrishStat I could not find a link that places the donations in a machine readable format. I just copied all the content from the dynamic webpage ( https://www.gofundme.com/f/stop-stack-overflow-from-defaming-its-users/ ) after clicking the 'See All' button. I would be nice if this could be automated such that data from other donation projects could be read in order to improve the model. The ARIMA is not good because it doesn't model very well the diminishing of the amount and height of donations, but the simple model that I used is neither good because donations are not so smooth. – Sextus Empiricus Dec 04 '19 at 10:18
  • Upon closer inspection the mean for the most recent 10 days was 118 ...this level shift was not tested for due to small sample size of 29 BUT I added that significant variable and the expected sum for the next 149 days is 1416 which is close to the 1217 posted by @REINSTATE MONICA. I have added that pix to my post. – IrishStat Dec 04 '19 at 11:16
  • My steady state forecasts are looking better and better as the dismal forecasts have not borne fruit as donations continue at recent levels – IrishStat Dec 18 '19 at 15:19
  • With my forecasts being more accurate than dour downwards trends .. how do I claim the bounty and get some kudos .... – IrishStat Dec 24 '19 at 17:12
  • Personally, I predict \$25,314. – Glen_b Dec 26 '19 at 06:47

3 Answers3

1

I took your 29 days (oldest to newest) and found that there were 3 unusual days thus the following equation enter image description here with Actual/Fit and Forecast here enter image description here

All models are wrong ... but some are useful .... . It is fundamentally an autoregressive process of order 1 after one has adjusted for the three "unusual data points " see enter image description here for a clear support for the anomaly identification.

Plot of the residuals enter image description herefrom the above model suggesting reduced variability is clearly obvious . It is reasonable to suggest that there has been a break-point in the model error variance suggesting GLS or a weighted model . This was not not investigated here due to sample size ! ).

Here is the plot of the original data enter image description here

While the variability of the series is higher at higher values suggesting to some that there is a need for logarithms http://stats.stackexchange.com/questions/18844/when-and-why-to-take-the-log-of-a-distribution-of-numbers ..it is truer yet that the error variance distribution is better characterized as having a deterministic change point at or about day 11.

enter image description here

IrishStat
  • 27,906
  • 5
  • 29
  • 55
1

ROUND TWO:

You asked “how do I do this with the log-link function and quasi(Poisson) errors?”. I say put aside your priors suggesting a particular fixed model and use a data-driven empirical process to identify the (possible) memory model, refining parameters and testing both necessity and sufficiency.

enter image description here

When you only have 29 days (4 seasons of daily data), I am normally reluctant to enable the automatic process to consider seasonal activity like day 6 as the OP has smartly viewed and pointed out ... a win for the human!

Following is the audit trail .... the ACF of the original series is here:

enter image description here

I suggested the possibility of a day 6 effect to the software which then identified supported that hypothesis while detecting three unusual points while incorporating an ar(1) effect shown here and here enter image description here and the companion PACF of the original series here:

enter image description here

The Actual/Fit and Forecast is here:

enter image description here

with forecasts here:

enter image description here

... all without assuming logarithms or any other possible unwarranted transformation.

Logs can be useful but the suggestion for a power transform for a theoretic model should never be made based upon the original data but on the residuals from a model which is where all the assumptions are placed that need to be tested. When (and why) should you take the log of a distribution (of numbers)?

enter image description here

Notice the ACF of the residuals series suggesting that it that the model can not be proven to be insufficient

enter image description here

and a supporting (not quite perfect !) residual plot here:

enter image description here

As Isaac Asimov said “the only education is self-education” and your question is certainly in that spirit.

EDITED AFTER OP REQUESTED A LONGER PERIOD OF FORECASTS (149 FORECAST PERIOD WAS USED )

Here is the Actual/Fit & Forecast graphenter image description here with forecasts here enter image description here

Simulation is perenter image description hereformed using the residuals from the model here

I selected not to allow for future anomalies and report here the simulation ( see Bootstrap prediction interval for an introductory discussion ) for a few select periods ahead

period 30 ... 1 day ahead

enter image description here

period 31 .... 2 day ahead

enter image description here

period 34 .... 5 day ahead (this is day 6 of the week )

enter image description here

period 178 ... 149 days ahead

enter image description here

And the sum for the next 149 periods Q.E.D. here

enter image description here

this example shows how prediction limits shouldn't be assumed to be symmetrical as errors form a useful model may not be normally distributed BUT are what they are.

Should you wish to extend the forecast period to 335 days to give you a 364 expectation simply prorate the 149 day prediction to 335 and add the actual for the first 29 (335 + 29 =364 ) to get your desideratum expectation for the first year.

Additionally you had queried about "the correlation of the errors" . Hereenter image description here is the ACF of the model's errors suggesting sufficiency and no need to worry about this possible effect . This is due to extracting the ar(1) effect and the day6 effect .

After adding the level shift indicator to the model ..here it is and the sum of the 149 day simulated predictions . much lower due to the level shift down at period 20 enter image description here

If I further assumed logs , I would expect the prediction to be even lower .

enter image description here

MarianD
  • 1,493
  • 2
  • 8
  • 17
IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • Do I interpret your results correctly in reading them as saying that (a) as much as \$11,000 could be *taken back* from the fund in the next five months; (b) as much as \$280,000 could be received; and (c) the expected additional amount is \$56,000 (which is over twice what has been collected so far)? None of these passes a reality check, so what am I misunderstanding? – whuber Dec 02 '19 at 20:13
  • The 29 days have produced a total of 22,201 thus a mean of 769 with a median of 299 . The forecast for the next 149 days has a projected total of 56,000 thus an expected mean of (56,000/149 )=380 per day with a projected median of (33,771/149)=299 per day . The historical distribution and the forecast distributions are quite non-normal . These results seem to me to be quite reasonable . Other values including extremes (both high and low) are presented along with their very small probabilities i.e. density functions to provide risk assessment .. – IrishStat Dec 02 '19 at 21:28
  • For example the probability of exceeding 280,000 or more is 0.0 while the probability of 11,000 or more being refunded is also 0.0 – IrishStat Dec 02 '19 at 21:33
  • 1
    This is thoroughly unreasonable, because such campaigns start out rapidly and taper off. That's the entire point of the OP's remarks. From time to time such a campaign can be publicized, leading to a short-term burst in activity, but long-term it is bound to experience a strong decline, typically exponential or faster. That's precisely what the data show. The monies won't be refunded, either: in the case of failure to reach the goal, they will all be given to a designated charity. – whuber Dec 03 '19 at 00:03
  • What you say is true ..that donations normally drop off ...of course the monies won't be reimbursed .. that is why one places over-rides due to domain knowledge such as tire ware can't be negative or monies returned . . As more data becomes available the decline will be more obvious to objective statistics ... Let us hope that others join in the tribute to Monica . – IrishStat Dec 03 '19 at 01:02
  • @IrishStat Possibly one can model this with a more mechanistic starting point. Say I have a potential pool of donators $D$ and a (random) fraction $c$ is gonna donate every day (but only only once such that the $D$ is decreasing). Then we could model $c$ as your arima model. But end up with a curve (for the donations made by the pool $D$) that is more like an exponential curve, Possibly there might be simpler ways to turn it into an exponentionally (or different) diminishing curve (like here for [kriging](https://stats.stackexchange.com/a/380218/164061)). But arima is not my strong suit. – Sextus Empiricus Dec 03 '19 at 07:22
  • arima is a polite way of transforming data that is auto-correlated into data that is free of auto-correlation so that standard methodolology/theory might apply. – IrishStat Dec 03 '19 at 07:26
1

For this type of problem, it should be possible to make a prediction of the total donations by predicting the infinite tail of donations, and adding this to the observed donations. To facilitate our analysis, suppose we let $M_t$ denote the donation received on day $t$, and let $U$ denote the total remaining donations, and $V$ denote the total donations (including the observed donations).

If we have observations for days $t = 0,1,...,T$ then we are making predictions for the infinite sequence of days $t = T+1, T+2, T+3, ...$. Under a GLM with a log-link function, the predictions will be of the form:

$$\hat{M}_t = \exp(\hat{\beta}_0 + \hat{\beta}_1 t).$$

It follows that the predicted value of the total remaining donations is:

$$\begin{equation} \begin{aligned} \hat{U} \equiv \sum_{t=T+1}^\infty \hat{M}_t &= \sum_{t=T+1}^\infty \exp(\hat{\beta}_0 + \hat{\beta}_1 t) \\[6pt] &= \exp(\hat{\beta}_0) \sum_{t=T+1}^\infty \exp(\hat{\beta}_1)^t \\[6pt] &= \exp(\hat{\beta}_0 + \hat{\beta}_1 (T+1)) \sum_{t=0}^\infty \exp(\hat{\beta}_1)^t \\[6pt] &= \frac{\exp(\hat{\beta}_0 + \hat{\beta}_1 (T+1))}{1-\exp(\hat{\beta}_1)}. \\[6pt] \end{aligned} \end{equation}$$

Thus, the predicted total donations (including the observed donations) is:

$$\begin{equation} \begin{aligned} \hat{V} \equiv \sum_{t=0}^T m_t + \sum_{t=T+1}^\infty \hat{M}_t &= \sum_{t=0}^T m_t + \frac{\exp(\hat{\beta}_0 + \hat{\beta}_1 (T+1))}{1-\exp(\hat{\beta}_1)}. \\[6pt] \end{aligned} \end{equation}$$

This value is the MLE prediction for the total donations (due to the invariance property of the MLE).


Implementation in R: I am going to implement this method using a negative-binomial GLM instead of a quasi-Poisson GLM. That advantage of the negative binomial model is that you actually have a full specified distribution, which makes it easier to obtain prediction intervals (if you so desire). In the code below I create the data-frame, fit the model, and then generate the total predicted donations. (Due to your update, I have generated a variable for the day of the week, but I have not incorporated this into the model. It is there if you decide you want to add it.)

#Generate the variables
Donations <- c(6085, 3207, 885, 1279, 1483, 75, 421, 335, 1176,
               504, 430, 110, 36, 299, 314, 215, 417, 1712,
               2141, 35, 235, 80, 330, 70, 70, 105, 65, 15, 180);
Time      <- c(0:28);
DAYS      <- c('Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun', 'Mon');
Day       <- rep(DAYS, 5)[1:29];

#Create the data frame
DATA <- data.frame(Donations = Donations, Time = Time, Day = factor(Day));

#Fit the model and extract the estimated coefficients
library(MASS);
MODEL <- glm.nb(Donations ~ Time, data = DATA);
COEFS <- summary(MODEL)$coefficient;
B0    <- COEFS[1,1];
B1    <- COEFS[2,1];

#Predict the remaining donations
UHAT <- exp(B0 + B1*nrow(DATA))/(1 - exp(B1));

#Predict the total donations
VHAT <- sum(DATA$Donations) + UHAT;

This particular model has a McFadden pseudo-$R^2$ of 38.89%, which can be improved if you add the day variable into the GLM. The predicted remaining donations and predicted total donations are shown below.

UHAT;
[1] 1109.464

VHAT;
[1] 23418.46

As you can see, under this method, we predict an additional \$1109.46 worth of donations, bringing the predicted total to \$23,418.46.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • Is there a good way to estimate the variance of $\hat V$ ? I would be able to do it for a point estimate (compute the variance of the predicted value as the sum of the variance of the estimated value based and the variance/error of a measurement). But, here we take the *sum* of several points which will have some correlation. I guess I could 1) express the measurement/intrinsic error for the different points $\hat M_t$ due to the Poisson noise 2) express the error for the estimate of the curve (by linearizing your expression for $\hat U$. 3) sum that together ignoring non-linearity effects. – Sextus Empiricus Dec 04 '19 at 10:32
  • If I were you, I would ignore the sum and just look at the functional form at the end. Ultimately, the variable part is the nonlinear function $\hat{U} = h(\hat{\beta}_0,\hat{\beta}_1)$ (since $T$ is known). Thus, you are essentially just estimating a nonlinear function of the parameter estimators. You could get an approximation using the delta method, which might be the simplest way to do it. – Ben Dec 04 '19 at 22:46