Sales forecasting with non-stationary data

Question

I would like to do sales prediction based on my sales data for a particular product for a year. I understand this is non-stationary data which needs to be converted into stationary data and modeled using auto.arima() to do forecasting for the next 30 days. I would appreciate if anyone can help me on this using R as I am new to this domain.

Date       Code    Sales      
 2015-01-01 65323     2 
 2015-01-02 65323     1 
 2015-01-03 65323     3 
 2015-01-04 65323     3 
 2015-01-05 65323     2 
 2015-10-05 65323     1 
 2015-10-06 65323     1 
 2015-10-11 65323     1 
 2015-10-14 65323     1 
 2015-10-17 65323     2 
 2015-10-18 65323     1 
 2015-10-21 65323     1 
 2015-10-24 65323     1 
 2015-10-25 65323     2 
 2015-10-26 65323     1 
 2015-10-28 65323     1 
 2015-10-31 65323     2 
 2015-11-02 65323     1 
 2015-11-03 65323     1 
 2015-11-05 65323     1 
 2015-11-06 65323     1 
 2015-11-07 65323     1 
 2015-11-10 65323     2 
 2015-11-14 65323     3 
 2015-11-16 65323     1 
 2015-11-17 65323     1 
 2015-11-18 65323     1 
 2015-11-22 65323     3 
 2015-11-23 65323     3 
 2015-11-26 65323     3 
 2015-11-27 65323     2 
 2015-11-29 65323     4

data look corrupt. Try to edit it back in. – zbicyclist Feb 11 '16 at 03:08 — zbicyclist, Feb 11 '16 at 03:08

score 2 · Answer 1 · edited Feb 11 '16 at 06:40

2

Try to re-edit data into a doable way. You can manually type them in CSV file. Once you do it, use R studio to read the data and do the following (you can even copy and paste):

data <-  ts(XXXX,start=c(2005,01,01),frequency=365) # creating a time series object where XXX is the name of CSV file

growth <- diff(log(data)) #the way to handle non-stationary data. difference of log implements growth rate of sales

library("forecast", lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library") # instilling forecast package. If you have not got it, download it.

model <- auto.arima (growth) #that is your model

forecast(model,h=30) #this will do a simple 30 days forecast

Note this is growth rate of sales. to get predicted sales figure, you need convert back.

edited Feb 11 '16 at 06:40

Learner

1,528
3
18
34

answered Feb 11 '16 at 04:18

Qin

31
2

I tried implementing your suggestion but i get below Error message > Warning message: In auto.arima(growth): Unable to calculate AIC offset > Error in auto.arima(growth): No suitable ARIMA model found > Error in forecast(model, h = 30): object 'model' not found Appreciate if you can put your pointers further on above. – Time App Feb 12 '16 at 01:12
What auto.arima function does is that it will select the model that yield the minimum information criteria, among many possible models. This is computer thing and is not recommend in textbook. In classical textbook you need to have a look ACF (Auto correlation function) and PACF (partial correlation function) to determine the lag order of AR term, MA term and whether you need seasonal components. I could not help further as I do not have you data. Your case is fairly simple, you just need one model then forecast, auto.arima function might be good might be bad. – Qin Feb 12 '16 at 02:09
If you really want to use auto.arima function, try this model – Qin Feb 12 '16 at 02:10
Qin, its the same error message displayed as below Warning message: In auto.arima(growth, ic = c("bic")): Unable to calculate AIC offset Error in auto.arima(growth, ic = c("bic")): No suitable ARIMA model found – Time App Feb 12 '16 at 05:51
@Qin: as per my answer above, Time App really needs to fill in zeros on appropriate dates, unless he wants a forecast that is badly biased. This invalidates your approach, because you would take logarithms of zeros. Plus, taking differences of logs is not "the way to handle non-stationary data". It is *one way* of handling a very specific kind of non-stationarity. Finally, ARIMA cannot handle the 9 month hole in the time series - it will simply squash the sales together. Sorry, but your answer is misleading. I don't like downvoting answers from new users, but I'm unhappy about the confusion... – Stephan Kolassa Feb 12 '16 at 08:55
@Stephan Kolassa: I understand and much appreciate your commend and correction above. I did not notice there were missing data as the data were a kind of jumbled and unreadable. Speaking of way of handle missing data (not due to missing at random), I am doing a mixed frequency factor model to forecasting macroeconomic variables. Do you know how can I use EM algorithm to estimate missing data so I can have a balanced pool of data? Much apprecipate. – Qin Feb 13 '16 at 07:52
@Qin: sorry, I can't help you there. Consider posting a question here at CrossValidated. – Stephan Kolassa Feb 13 '16 at 18:39

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

Your data

Looking at your data, I notice (a) that some days are missing, such as 2015-10-07 to 2015-10-10, and (b) that you do not have zero sales values. This looks very much as if your data contained only "actual" (that is, non-zero) transactions.

If so, the first step should be to edit your data so days with zero sales actually are included as zeros. (If you only include nonzero sales, you will be modeling sales conditional on being nonzero, so your forecast will be biased upward.)

Stationarity and non

Now that you have zeros in your data, taking logarithms as Qin suggests will not be possible any more. I'd recommend a different approach to tackling non-stationarity, anyway.

Stationarity essentially means that the data-generating process does not change over time. Thus, non-stationarity means that the DGP does change. Of course, the DGP can change over time in many different ways: there can be periodic seasonality, there can be trends, or there could even be changes in variance, while the expectation stays constant over time. It is important to figure out just how your time series is non-stationary, and how to deal with the specific type of non-stationarity involved.

With daily sales data, you usually have multiple seasonalities: one year-over-year, and one within weeks. (People do more shopping on Saturday.) Or even monthly or biweekly seasonality, based on when paychecks arrive. Your variance will usually also be higher on the weekend. Long-range trends are usually not overly important, but you may have lifecycle effects, e.g., in fashion or consumer electronics. In addition, you may have causal effects like promotions, price changes or markdowns. To be honest, once these effects are taken care of, my impression is that there are few additional autoregressive or moving average dynamics left to model.

Finally, you appear to have very fine-grained data, on store level? Your data appears to be all integers, that is, count data.

ARIMA for sales data (or not)

Now, ARIMA may not be your best choice. First, it is not built for count data series - ARIMA models expect normally distributed errors. There are a few Integer ARMA (INARMA) models around, but these are not very common.

In addition, ARIMA or INARMA cannot deal with multiple seasonalities, as you probably have in your data. You could simply work with frequency=7 to pick up the weekly seasonality, disregard the yearly seasonality, and hope for the best. (If you have less than two years of data, you won't be able to make ARIMA fit yearly seasonality, anyway.)

(EDIT) Finally, you have a nine-month period where the product apparently was not available at all. ARIMA cannot deal with this. If you apply ARIMA to a time series such as this, you will either need to fill in the entire nine months with zeros (which would simulate true zero demand - but if the product was unavailable, we don't know demand during this period, so filling zeros is unwarranted), or you'd just disregard the hole and "squash" your data together - even though sales in January may have very little to do with those in October. Both are obviously inappropriate.

Recommendations

My recommendation would be to run a count data model, say a Poisson regression or a negative binomial regression, using day-of-week dummies and some kind of hump functions for yearly seasonality (if you have at least one year of history). Add any promotional campaigns you know of as regressors.

If you don't have any promotional campaigns, you could also try a very simple benchmark: for next Monday, forecast average sales over historical Mondays; for next Tuesday, forecast average sales over historical Tuesdays, and so forth. This looks very naive, but it performed extremely well in a paper I have recently had accepted in the International Journal of Forecasting (Kolassa, 2016).

Forecast evaluation

Whatever you do, make sure to evaluate your forecasts using appropriate criteria - that is, some kind of Mean Squared Error (MSE). Do not use the Mean Absolute Error (MAE) for skewed count data - if you minimize the MAE, you will get forecasts that are biased. In extreme cases (say, Poisson distributed sales with a mean below $\log 2\approx 0.69$), your MAE will be lowest for a flat zero forecast. See here and here for details.

In R

Let's run this through R. Here is your data:

dataset <- structure(list(
  date = structure(c(16436, 16437, 16438, 16439, 16440, 16713, 16714,
    16719, 16722, 16725, 16726, 16729, 16732, 16733, 16734, 16736,
    16739, 16741, 16742, 16744, 16745, 16746, 16749, 16753, 16755,
    16756, 16757, 16761, 16762, 16765, 16766, 16768), class = "Date"),
  sales = c(2, 1, 3, 3, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1,
    1, 1, 1, 2, 3, 1, 1, 1, 3, 3, 3, 2, 4)),
  .Names = c("date", "sales"), row.names = c(NA, -32L), class = "data.frame")

(I read it in using scan(), then created this reproducible example using dput(). You can read your data from a file using read.table().)

Let's plot this. Always start by plotting the data:

with(dataset,plot(date,sales,type="o",pch=19,xlab="",ylab="",
  ylim=range(c(0,dataset$sales))))

Now, you write in your comment that your data is seasonal. That explains the long string of no data between 2015-01-05 and 2015-10-05. However, you have positive sales on 2015-10-05, 06, 11 and 14. In my experience, this means that product was available on 2015-10-07, 08, 09, 10, 12 and 13, but that there simply was zero demand. As per above, we need to take these zero demands into account in modeling and forecasting, since zero demands will occur again in the future, and if we don't consider them historically, our forecasts will be biased upward. So I'll transform your data by filling in zeros on dates with no sales records between 2015-10-05 and 2015-11-29. (I would do the same between 2015-01-01 and 2015-01-05, but all dates there do have nonzero sales records.)

(Note incidentally, that we have a source of potential confusion here. To a retailer, a "seasonal" product is often one that is only available during certain times of the year. To the time series analyst and forecaster, "seasonal" is a time series that is defined the entire time, but exhibits seasonal peaks and troughs.)

dataset.new <- data.frame(
  date=seq(min(dataset$date),max(dataset$date),by="day"),sales=NA)
dataset.new[as.Date("2015-10-05")<=dataset.new$date,"sales"] <- 0
dataset.new[dataset.new$date %in% dataset$date,"sales"] <- dataset$sales

(Note that the last line relies on both dataset and dataset.new being sorted by date.)

with(dataset.new,plot(date,sales,type="o",pch=19,xlab="",ylab=""))

Note the zero sales inferred at the end.

Now we can start modeling. Let's first define a forecast period:

horizon <- 30
dates.forecast <- max(dataset.new$date)+(1:horizon)

We start with a simple day-of-week model, which will simply forecast the average of Monday sales for future Mondays and so forth, as proposed above. The easiest way to do this is to use lm() to create a linear model together with weekdays():

model.dow <- lm(sales~weekdays(date)-1,dataset.new)
summary(model.dow)
forecast.dow <- predict(model.dow,newdata=data.frame(date=dates.forecast))

The summary shows the inner workings of the model. Here is part of its output:

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
weekdays(date)Friday      0.4444     0.3464   1.283 0.204969    
weekdays(date)Monday      1.0000     0.3464   2.887 0.005586 ** 
weekdays(date)Saturday    1.3333     0.3464   3.849 0.000316 ***
weekdays(date)Sunday      1.5556     0.3464   4.491 3.78e-05 ***
weekdays(date)Thursday    0.6667     0.3464   1.925 0.059566 .  
weekdays(date)Tuesday     0.6250     0.3674   1.701 0.094687 .  
weekdays(date)Wednesday   0.5000     0.3674   1.361 0.179219

Since I used -1 in my lm() call, no intercept is used, so the coefficients correspond to the weekday averages and the forecasts.

We can also add a trend, simply by adding a +date in the lm() formula, since dates are internally converted to counts of days:

model.dow.trend <- lm(sales~weekdays(date)+date-1,dataset.new)
summary(model.dow.trend)
forecast.dow.trend <- predict(model.dow.trend,newdata=data.frame(date=dates.forecast))

Here is the summary:

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)  
weekdays(date)Friday    61.722324  25.609466   2.410   0.0194 *
weekdays(date)Monday    62.266061  25.604528   2.432   0.0184 *
weekdays(date)Saturday  62.614880  25.610999   2.445   0.0178 *
weekdays(date)Sunday    62.840770  25.612531   2.454   0.0175 *
weekdays(date)Thursday  61.940878  25.607933   2.419   0.0190 *
weekdays(date)Tuesday   62.015965  25.656988   2.417   0.0191 *
weekdays(date)Wednesday 61.894633  25.658521   2.412   0.0193 *
date                    -0.003668   0.001533  -2.393   0.0203 *

The date coefficient tells us that sales decrease by 0.0037 units per day. (Remember that there were quite a few inferred zero sales dates near the end of your time series, but none at the beginning.)

Let's plot:

with(dataset.new,plot(date,sales,type="o",pch=19,xlab="",ylab="",
    xlim=range(c(dataset.new$date,dates.forecast))))
lines(dates.forecast,forecast.dow,type="o",pch=19,col="red")
lines(dates.forecast,forecast.dow.trend,type="o",pch=19,col="green")

We can see the day-of-week pattern, as well as the downward trend in the green forecast. If you want longer-range trended forecasts, it is good practice to dampen the trend, for which you'd need to explicitly define a trend column in your data matrices.

Now you can also start including hump function as additional data matrix columns, in order to model seasonal patterns (in the time series sense, see above). However, with the little data you have, I'd be very careful about creating too complex a model - your forecast accuracy will likely go down soon as you start overfitting.

Very Thanks for your elaborate explanation on the data posted , the product is used seasonal hence those gaps between the sales data recorded. As you recommended "for next Monday, forecast average sales over historical Mondays; for next Tuesday, forecast average sales over historical Tuesdays, and so forth." i would like to try this approach , appreciate if you can guide me further as how to implement on my data. — Time App, Feb 12 '16 at 01:21
@TimeApp: I edited my answer extensively, adding a treatment in R. Hope it helps. — Stephan Kolassa, Feb 12 '16 at 08:49
Thanks Stephan very much for your elaborate method of walking through your recommendation , to continue further can i apply this for all the product data i have collected in .csv using loop , though the data shared earlier and the one's in csv's for different product are not in same pattern, ,please advice. — Time App, Feb 15 '16 at 05:32
Sure. If product code are in the `Code` column, use a `for ( cc in unique(dataset$Code) )` loop, then select the relevant rows, e.g., `subdataset — Stephan Kolassa, Feb 15 '16 at 12:56
Does it mean that at every product code i read in loop , i check for the date difference ,if the date difference is huge say more than 1 month , i start zero filling from then and adjust the Sales data reading from original dataset , please correct me if iam wrong , would be great if it can be explained further with sample dataset using loop — Time App, Feb 16 '16 at 06:14

score 0 · Answer 3 · answered Feb 17 '16 at 09:15

@Stephan : Tried with another data set as below, keeping both initial dataset and current one in one csv and making a logic of loop to forecast, below is the details of it

Date Code Sales 2015-01-01 32871 4 2015-01-02 32871 2 2015-01-03 32871 1 2015-01-04 32871 3 2015-01-05 32871 5 2015-01-06 32871 1
2015-01-07 32871 3 2015-01-08 32871 5 2015-01-09 32871 1 2015-01-10 32871 2 2015-01-11 32871 1 2015-01-12 32871 1 2015-01-13 32871 1 2015-01-14 32871 3 2015-01-15 32871 6 2015-01-16 32871 7 2015-01-17 32871 4 2015-01-18 32871 11 2015-01-19 32871 13 2015-01-20 32871 4 2015-01-23 32871 5 2015-01-24 32871 8 2015-01-25 32871 9 2015-01-26 32871 1 2015-01-28 32871 3 2015-01-29 32871 2 2015-01-31 32871 4 2015-02-01 32871 5 2015-02-10 32871 6 2015-02-11 32871 2 2015-02-28 32871 1 2015-03-01 32871 1 2015-03-02 32871 5 2015-03-04 32871 1 2015-03-05 32871 5 2015-03-06 32871 6

Can u please advice on the result gotten and as request i would to check your approach on my last comment.