2

I have historical data for a particular metric for each month for the last 3 years for different categories. The metric is a percentage and its heavily skewed towards 1 with more than 75% of values being above 0.9 but some values as low as 0.3

My idea was to create some form of time series forecast but one which I can simulate thousands of times to get the probability the metric for a month in the future might be higher than 0.95 for example.

I tried a linear model but that doesn't work at all

Alex Thom
  • 51
  • 4
  • Why don't you post an example of one of your time series and I will try and help you as it is important to create probability distributions for each forecast period NOT just a forecast and a set of symmetric limits based upon a normality assumption of the errors.Given that your coefficients are statistically significant and your model's errors have passed multiple "model specification tests" then I would obtain confidence limits by re-sampling the – IrishStat Jun 10 '19 at 20:11

1 Answers1

1

It seems that you are strugling with an adequate assumption about the distribution of the response variable. Classical linear regression and classical ARMA-models assume that the response variable, has support on all the real numbers $(-\infty, \infty)$. Often the response is also assumed to be normally distributed. This is clearly not the case in your application.

I would first try to disregard the (potential) time interdependence of the data and fit a Beta-regression. The Beta-regression is a Generalized Linear Model (GLM) assuming the response variable follows a Beta-distribution, when conditioning on co-variates. The Beta-distribution is a very flexible continuous distribution on the unit interval, $(0,1)$. This answer has some good references: Regression for an outcome (ratio or fraction) between 0 and 1.

If you find that there is significant serial correlation in your response variable that the co-variates cannot account for, I would look into Beta-ARMA models of Rocha & Cribari-Neto (2009) or Guolo and Cristiano Varin (2014). Guolo and Cristiano Varin (2014) is probabely the easiest one to get started with since they have a nice example in R where they fit a Beta-ARMA model to illness percentage over time.

Duffau
  • 557
  • 2
  • 9
  • 1
    "Often the response is also assumed to be normally distributed." There are no assumptions necessary about the distribution of the response ... all distributional assumptions deal with the error process... – IrishStat Jun 10 '19 at 21:58
  • True, it's often stated regarding the errors, but it's equivalent (for the most part) to having a normal response. If you for example assume $Y\sim N(X^{\prime}\beta, \sigma^2)$, which is the GLM with a normal assumption, it is equivalent to assume normal errors. In the stationary ARMA-case having normal errors is equivalent to $(Y_1, \ldots, Y_T)$ being jointly normal, which implies marginal nomality as well. This is of course a special characteristic of the normal distribution and is not generally true. – Duffau Jun 10 '19 at 22:38
  • I disagree if you have a time series say of 100 values where the first 50 are normally distributed about mean 1 and the most recent set of 50 values have a normal distribution around mean2 the 100 values together will have a non-normal distribution. Now the residuals from a useful model that has a level shift predictor at time period 51 will have a normal distribution thus parametric test apply. – IrishStat Jun 11 '19 at 01:37
  • But what your are describing is not a stationary ARMA-process. – Duffau Jun 11 '19 at 06:41
  • I agree ... so then your comments about distributional equivalence is appropriate for stationary ARMA processes but not in general. – IrishStat Jun 11 '19 at 07:10