Can I use month as a numerical variable (and not as a categorical variable) in linear regression

Question

Let's say my response is $Y$.

I know for a fact that $Y$ decreases during winter time, and then starts to increase around spring. So, does it make sense to use month (not as a categorical variable) as a predictor for $Y$?

For example, numerical values go from March = 1 to Feb = 12. So as the value of month increases, $Y$ decreases. There will be an inverse relationship.

I already did this and the adjusted $R^{2}$ went from .45 to .56. It also has a very small $p$-value. Note that I also have other predictors in the model.

I am trying to see if doing this can have any adverse effects on the multiple linear regression (MLR) model. Does anyone here know of any such possibilities?

Just from what you say here, I think 'month' should remain a categorical variable. The question is how many levels the categorical variable should have. May be best to have 2 to 4 'seasons' instead of 12 individual months. // Designating months or seasons by integers would be OK, but the variable should still be treated as categorical (for example, in R declared `as.factor`.) Otherwise, `3` means three times as much of _something_ as `1`. — BruceET, Nov 18 '20 at 21:49
What if the model suffers when I use month as 2 seasons or 4 seasons? I tried using month as 12 factor categorical variable. As expected results were better but just slightly (compared to when I used it as numerical). Now I have an additional 11 variables instead of 1 run diagnostics on. I already have 8 variables so the total comes to 19. Is is possible to treat it as an approximate model (the one with month as numerical)? — Shashank, Nov 18 '20 at 22:35
This is not a place to discuss specific programming problems for particular software. However, I am not sure why treating month as a 12-level categorical variable increases the number of independent variables in your linear model. — BruceET, Nov 18 '20 at 22:39
because a 12 level categorical variable is 11 binary variables. — Shashank, Nov 18 '20 at 22:42
I edited your question but observed something I didn't quite understand. Are you saying *between* March and February you observe decreasing values in $Y$? Also, why start with March = 1? I will roll back my edit if I misunderstood you. — Thomas Bilach, Nov 18 '20 at 23:09
March, April, May have high values and Jan and Feb have lower values. So, from March to Feb there is a decreasing trend. If I start from March=1,April=2 ... and so on to Feb =12, then there would be near perfect inverse relationship between Y and month. — Shashank, Nov 18 '20 at 23:13
Sine and cosine functions of month may sometimes be a good idea as predictors. At best that might mean 2 (4, 6 ...) parameters to estimate rather than 11. Note that seasonality is periodic and only crudely captured by a linear term, and changing the origin won't help much. Much depends on what $Y$ is; there is no obvious reason for concealing that. To the extent that it is say climatic, seasonal variation may be smooth. To the extent that it is human or human-influenced, indicators for each month could be a better idea, as (e.g.) there could be jumps around December. — Nick Cox, Nov 18 '20 at 23:28
Y is basically Revenue. I don't understand what you mean by using sine and cosine functions of month. Can you elaborate? — Shashank, Nov 18 '20 at 23:34
I just used sin(Month) as a predictor, and the model improved a bit, QQ plot looks better and the residuals plot for month looks more random. — Shashank, Nov 19 '20 at 00:55
It is rare that just sine or cosine should be used by itself. The point (and much else) is elaborated in https://www.stata-journal.com/article.html?article=st0116 (pdf accessible) You should want to use say $\sin(2 \pi m/12)$ and $\cos(2 \pi m/12)$ where $m := $ month $1, \dots, 12$. — Nick Cox, Nov 19 '20 at 10:07

Can I use month as a numerical variable (and not as a categorical variable) in linear regression

0 Answers0

Linked