3

I have some data:

column 1 = dates (daily data from say 1st Jan 2014 to 1 July 2014)

column 2 = person (about 10 different people)

column 3 = sales made (daily data from say 1st Jan 2014 to 1 July 2014)

I calculated what periods the particular person was 'working' vs 'on holiday' vs 'weekend' in column 4 (unfair to say anyone will make sales in the weekend).

I want to determine if holiday has a significant effect of revenue. Can I use a normal linear model or do I need a time series model? Is the person irrelevant in this case?

Would 'working', 'on holiday', and 'weekend' be groups and I run an ANOVA?

Thank you.

Dino Abraham
  • 439
  • 5
  • 11

2 Answers2

1

I challenge the notion that sales on consecutive days are independent of one another. The comment about diagnostic plots make me think, however, that you have some evidence that they are independent, so let's proceed with that assumption.

First, ANOVA is a linear model. For $k$ groups, your model is:

$$y_i = \beta_0 + \beta_1 I_1 + \beta_2I_2 + \cdots +\beta_{k-1}I_{k-1} +\epsilon_i$$

The $I$ terms are indicator variables (0 or 1) denoting if observation $i$ belongs to group 1, 2, etc. In your case, you have 3 groups (weekday, weekend, holiday), so you have two indicator variables. The third one gets sucked into the intercept term. This has to do with keeping the model matrix of full rank, a technical point that I think is best deferred to a separate question (one that must be addressed somewhere on Cross Validated).

Yale has a decent description of the details of this, as do other posts on Cross Validated.

So far, you probably get what going on and can develop a model matrix to stick in some software to get you the regression equation and some inferences on the parameters to check if the parameter corresponding to the weekend is significant. (R has the lm function.) What might be worth considering is including the particular salesperson. After all, one person may just be better at making sales. Another member may come along and disagree with me on this, but I think that the person should be something called a random effect, which involves somewhat different ways of estimating the parameters and doing inference on them. R has the lme4 package to do random effects through the lmer function.

The gist of a random effect is that the levels are drawn from a larger population. Your three salespeople are the three that you happen to have out of the 7 billion people on Earth, and I suspect that you want your results to generalize, not to be specific to these three salespeople.

The way that I would write this for lmer is L <- lmer(y~I1+I2 +(Katya || Dino || Dave),Sales_Data).

Sales_Data would be a data frame with the headers “I1” (weekend), “I2” (holiday), "Katya", "Dino", and "Dave". All three are indicator variables.

(I concede that I haven't done this in years, and I am torn on whether all three salespeople should be included, or if there should be two with the third being sucked into the intercept term. If someone wants to put their take in a comment...)

This random effects business most likely is new to you, so I don't want to have it get in the way of you doing a decent analysis. ("Perfect is the enemy of good" or something.) If you believe that there is no time dependence, then ANOVA and linear models are both acceptable and, in fact, are equivalent.

We can discuss in the comments (perhaps eventually chat) if you want to pursue the random effects avenue.

Dave
  • 28,473
  • 4
  • 52
  • 104
-1

Because sales on any particular day are probably not a function of sales on the preceding day, I'd argue that time-series or repeated measures model is not necessary in this case. In other words, aside from a day being a holiday or not, date data are not used in the model as you described. Person is certainly important and can be included as a block (or a random effect).

Next step needs clarification: how are working/on holiday/weekend mutually exclusive - Do you refer to a day as "a holiday" (your treatment) and person being "on holiday" (equivalent to not working or working on a holiday)? The levels of trt for ANOVA, based on your question, will be "holiday" and "not holiday", and the only data you'd include would be days when each person is working, because zero sales when a person is not working are not true/no sales zeroes in the context of the problem.

katya
  • 2,084
  • 8
  • 11
  • Thank you for the reply. What you wrote is what I was also thinking but here are the complications: (1) a company may book in, say 15k, and spend 3k on day, 2k the next, and then the remaining 10k the next. Why? Because they may spend more of the total 15k when the contact is in the office to help them vs not. Hence I wasn't sure the sales were actually independant. (2) There are 0 sales over the weekend always (just sales not spend) because no one is in the office (people still spend though). I think an ANOVA with holiday/not holiday makes sense though just wanted to clarify the above. – Dino Abraham Nov 13 '14 at 09:23
  • Is there a statistical way of checking independence. I have the daily rev data for days OOO and days in office split into 2 columns (and a total column). Is a regression not required here at all? – Dino Abraham Nov 20 '14 at 13:02
  • *ignore - solved in R using diagnostic plots :) – Dino Abraham Nov 20 '14 at 21:11