2

In order do make a regression for daily sales data i need to set up different dummies (e.g. day of the week, monthly, yearly, week of the year, moving holidays...)

dummies <- cbind(model.matrix(~template$Weekday)[,2:7],
                 model.matrix(~as.factor(template$Month))[,2:12],
                 model.matrix(~as.factor(template$Year))[,2:5],
                 model.matrix(~as.factor(template$CalendarWeek))[,2:53])
colnames(dummies) <- c('Tue','Wed','Thu','Fri','Sat','Sun',
                   'Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec',
                   paste0('y',rep(2:5)),   #5 years of data in total
                   paste0('w',rep(2:53)))   #calendar weeks

"template" is a data.table which includes all the information (e.g. Sales, Date, Weekday...)

Right now i am missing holiday dummies (including lead & lag dummies). In order to catch the moving holidays i have to create one dummy per holiday.

How do i figure out how many lead and lag dummies i need for different holidays? Is there a way to create them "on the fly" meaning i start adding lead/lag variables step by step (e.g. first: Easter-1, second: Easter-2...) and check each time if my regression model improves (e.g. AIC goes down)?

How do i deal with not-moving holidays? Simply creating dummy variables for "day of the year"? E.g. 366 dummy variables

Are there any kind of dummy variables missing so far? E.g. day of the month?

Thanks for your support!

Update:

Sample data

RandomDude
  • 475
  • 1
  • 7
  • 14
  • Post your data to dropbox.com and specify the beginning date and what country the data is from. – Tom Reilly Sep 30 '15 at 14:36
  • Hi @TomReilly, i added the data under Update. Main country is certainly the US but it is possible that there are influences of other countries (talking about specific holidays) in the data - it is coming from an international supply chain... – RandomDude Sep 30 '15 at 18:22
  • Hi...the link to the data doesn't work anymore...can you fix and I will post the whole model and results to dropbox. – Tom Reilly Oct 16 '15 at 12:30

1 Answers1

1

This is not the full model. There are a total of 74 variables. It looks like you have already have been drinking at the trough of daily data. What is there to be learned differently from your question here that wasn't answered here? Decomposition of daily time series (several years) with multiple seasonal patterns

enter image description here

Tom Reilly
  • 1,677
  • 11
  • 13
  • Thanks for your answer. Well i already accepted that there is no "simple" way to forecast daily data + there is a reason why software solutions like autobox exist. But for the sake of my thesis i have to come up with some model to forecast daily data. So i am comparing different approaches available in R (e.g. tbats, regression,...) and compare the results of different datasets that i have to forecast in order to suggest the "least worst" model. – RandomDude Oct 01 '15 at 12:05
  • So i am trying to improve the regression approach as much as i can. I set up dummy variables and afterwards use variable selection to find a final regression model. Would you suggest any other dummy variables that i can/should implement: Day of the week, Month, Calendar Week, Day of the year, Moving holidays (including lead and lag)? Don't think i can implement Level Shift or Pulse, at least i am not aware of any method. – RandomDude Oct 01 '15 at 12:11
  • Does it make sense to create dummies for the different Years? – RandomDude Oct 01 '15 at 12:13
  • 1
    As you can see above, day of the week, month of the year were important, but not all of them. Yes, certain holidays and their lead and lags are important. Why don't you download a 30 day trial from autobox.com and use that as well? Day of the year is not something I have ever seen before. What could that be? Day of the month is something we have seen. – Tom Reilly Oct 01 '15 at 12:16
  • 1
    No, there is no sense to think that a given year would be different unless there was some marketing or policy change. "Dummies for different years" is like a level shift variable or multiple ones. – Tom Reilly Oct 01 '15 at 12:18
  • Maybe i am using the wrong name for it or it just doesn't make sense?! Day of the year: 366 different dummies to represent each date of the year. For example: January has 31 days, so i need 30 dummies, February has 29 days (sometimes 28) so i need 28 dummies..., December has 31 days so i need 30 dummies. I thought in that way i could catch all holidays that are related to a specific date (4th July, Christmas) as well as their lead and lag effects. Day of the month is not month specific right? So there are 30 dummies for 31 possible days of the month (1st to 31. of a month)? – RandomDude Oct 01 '15 at 13:43
  • 1
    Already thought about using autobox, as soon as my model works i will download it so i can compare performance and also show that it makes sense to invest in such a software solution. – RandomDude Oct 01 '15 at 13:47
  • Take a look at slides 45-55 http://www.autobox.com/cms/index.php/afs-university/intro-to-forecasting/doc_download/53-capabilities-presentation For day of the month, it is trying to measure the same impact on a given day of the month. For days of the week, you create 6 dummies. For months of the year, 11 dummies, and the holidays. – Tom Reilly Oct 01 '15 at 19:44
  • If you ignore the outliers, level shifts, changes in seasonality, level shifts then you don't get a good read on the main variables. – Tom Reilly Oct 01 '15 at 20:28
  • Any progress here? – Tom Reilly Oct 15 '15 at 15:25