Dates as regressors in linear regression

Question

I recently read this post: Does it make sense to use a date variable in a regression?

Where the accepted answer says that dates can be used as regressors. So what I have done so far is the following: My start date is January 1, 2018, so this is day 0. January 2, 2018 is day 1 and so on, so December 31 is day 364. I also have data for 2019, so I made January 1, 2019 day 365, January 2, 2019 day 366 and so on. The thing is, I also have categorical variables that I converted to dummy variables using One Hot Encoding. So my data looks like this basically:

Day     Feature1     Feature2     Feature3
0       0             1             0
1       1             0             0
2       0             1             0
3       0             0             1
...
400     0             0             1

The question is: day is always increasing, so what I would think is that scaling so that it takes values between 0 and 1 (using MinMaxScaler in Python for example) would be a good idea. However, if I want to forecast what happens in day 401, how do I enter this value in the model that (I believe) is of the following form:

$y = \beta_0+\beta_1(Day)+\beta_2(Feature1)+\beta_3(Feature2)+\beta_4(Feature3)$

If Day is a number between 0 and 1, how will day be also a number between 0 and 1 if scaling only applies until day 400?

Looks like sklearn's `MinMaxScaler` will transform the a value of 401 to be a proportional distance away from the max. For instance, if you use `MinMaxScaler` on `np.arange(0,101)` and then call the scaler on `101` the result will be 1.01. Seems fine to me. If you use something like a linear model, this shouldn't be a problem, but methods like random Forest will not be able to generalize beyond the data they see in the training set. — Demetri Pananos, Mar 21 '19 at 05:01
I do not really agree with the answers to the other question you posted: I do not think that dates are good regressors. In fact, you adress one of the problems: models always have to extrapolate (i.e. do prediction on ranges of inputs they have never seen before). Lets say you scale the numbers between 0 and 364 to the interval [0,1]. Then day 365 would be a value like 1.01 or so. However, there has not been a single training sample with that value for that feature. If its about time series analysis, why don't you use lagged features (like Feature1 at the last day, ... — Fabian Werner, Mar 21 '19 at 06:04
The real question is whether date can be a good predictor for your data and your problem and we can't possibly say. But the way you have defined date means that you are only looking for a linear trend in date and have no hope of picking up seasonality (dependence on time of year) if it is important, and it is often is, in fields from environmental through medical to economic. Whether you scale from days = 1 to 400 to fraction of record = 0 to 1 is a trivial question of parameterisation, but clearly affects the mechanics of how you plug in numbers. — Nick Cox, Mar 21 '19 at 07:35
@FabianWerner Whether dates are "good" predictors depends sensitively on project goals and what else is available. Date is often a proxy for something else, as when (e.g.) seasonal variations in sales or unemployment at a national scale depend on a host of other variables that no one could or would want to obtain and build into a model. Similarly even if there are (e.g.) rough linear, exponential, or logistic trends in time allowing dates to be used empirically does not deny the principle that predictors closer to the underlying processes might be preferred, but choices have to be practical. — Nick Cox, Mar 21 '19 at 07:41
I have voted to close as unclear. Advice on what you should do depends on information you haven't given on your project and your data. It's a close call whether this is a duplicate of the post you cite, as I don't see that you are raising different issues that could be of wider interest. — Nick Cox, Mar 21 '19 at 07:54
Further, I agree with @FabianWerner that it's not clear whether your real problem calls for time series analysis. — Nick Cox, Mar 21 '19 at 08:21

Dates as regressors in linear regression

0 Answers0