0

When it comes to sequence modeling, with a given datetime feature in the column:

How to handle additional inputs such as year # , day # , month #

Should these data be pre processed as a regular quantitative feature (for example Z score scaling) or labeled a categorical feature and encoded?

for example:

# this column is already in the data
dates = [ 2021-1-1 , 2021-2-2, 2021-3-3]
# separate added features
day = [ 1 , 2, 3]
month = [1,2,3]
year = [2021,2021,2021]

I've seen both ways but really would appreciate some reasoning.

  • Categorical or dummy encoding models step changes, with no relationship between neighboring days/months/years. This usually makes no sense - neighboring time buckets will be more closely related than time buckets far apart. Far better to use numerical encoding with periodicities for the day of the month and the month of the year, or possibly using splines. Please take a look at my answer at the proposed duplicate; if that is not helpful enough, feel free to elaborate on remaining questions. – Stephan Kolassa Jul 20 '21 at 13:56
  • This depends to some extent on what your data are and what you know about their background. A standard thing in time series modelling would be to code them consecutively daywise as 1 for the first day you have (or the first day that can occur), and then 2,3,4, etc. over the whole available time period. If for some reason there may be effects for specific months and years it may be worthwhile to keep the month and year, although seasonal effects (as well as weekday effects that may be relevant in some applications) can also be modelled based on consecutive day coding. – Christian Hennig Jul 20 '21 at 14:01

0 Answers0