1

Let's say my response is $Y$.

I know for a fact that $Y$ decreases during winter time, and then starts to increase around spring. So, does it make sense to use month (not as a categorical variable) as a predictor for $Y$?

For example, numerical values go from March = 1 to Feb = 12. So as the value of month increases, $Y$ decreases. There will be an inverse relationship.

I already did this and the adjusted $R^{2}$ went from .45 to .56. It also has a very small $p$-value. Note that I also have other predictors in the model.

I am trying to see if doing this can have any adverse effects on the multiple linear regression (MLR) model. Does anyone here know of any such possibilities?

Thomas Bilach
  • 4,732
  • 2
  • 6
  • 25
Shashank
  • 71
  • 1
  • 2
  • 3
    Just from what you say here, I think 'month' should remain a categorical variable. The question is how many levels the categorical variable should have. May be best to have 2 to 4 'seasons' instead of 12 individual months. // Designating months or seasons by integers would be OK, but the variable should still be treated as categorical (for example, in R declared `as.factor`.) Otherwise, `3` means three times as much of _something_ as `1`. – BruceET Nov 18 '20 at 21:49
  • What if the model suffers when I use month as 2 seasons or 4 seasons? I tried using month as 12 factor categorical variable. As expected results were better but just slightly (compared to when I used it as numerical). Now I have an additional 11 variables instead of 1 run diagnostics on. I already have 8 variables so the total comes to 19. Is is possible to treat it as an approximate model (the one with month as numerical)? – Shashank Nov 18 '20 at 22:35
  • This is not a place to discuss specific programming problems for particular software. However, I am not sure why treating month as a 12-level categorical variable increases the number of independent variables in your linear model. – BruceET Nov 18 '20 at 22:39
  • because a 12 level categorical variable is 11 binary variables. – Shashank Nov 18 '20 at 22:42
  • I edited your question but observed something I didn't quite understand. Are you saying *between* March and February you observe decreasing values in $Y$? Also, why start with March = 1? I will roll back my edit if I misunderstood you. – Thomas Bilach Nov 18 '20 at 23:09
  • March, April, May have high values and Jan and Feb have lower values. So, from March to Feb there is a decreasing trend. If I start from March=1,April=2 ... and so on to Feb =12, then there would be near perfect inverse relationship between Y and month. – Shashank Nov 18 '20 at 23:13
  • 3
    Sine and cosine functions of month may sometimes be a good idea as predictors. At best that might mean 2 (4, 6 ...) parameters to estimate rather than 11. Note that seasonality is periodic and only crudely captured by a linear term, and changing the origin won't help much. Much depends on what $Y$ is; there is no obvious reason for concealing that. To the extent that it is say climatic, seasonal variation may be smooth. To the extent that it is human or human-influenced, indicators for each month could be a better idea, as (e.g.) there could be jumps around December. – Nick Cox Nov 18 '20 at 23:28
  • Y is basically Revenue. I don't understand what you mean by using sine and cosine functions of month. Can you elaborate? – Shashank Nov 18 '20 at 23:34
  • I just used sin(Month) as a predictor, and the model improved a bit, QQ plot looks better and the residuals plot for month looks more random. – Shashank Nov 19 '20 at 00:55
  • It is rare that just sine or cosine should be used by itself. The point (and much else) is elaborated in https://www.stata-journal.com/article.html?article=st0116 (pdf accessible) You should want to use say $\sin(2 \pi m/12)$ and $\cos(2 \pi m/12)$ where $m := $ month $1, \dots, 12$. – Nick Cox Nov 19 '20 at 10:07
  • This is very helpful. Thank you. – Shashank Nov 19 '20 at 18:05

0 Answers0