2

It's common to normalize the different vars before applying some kind of supervised/unsupervised learning.

Which algorithm do you use with the dates? You use the day of year (1, 200, 300) and perform the scaling/normalization on that values?

Or is there a way to maintain the circular values of the date (because 365 It's more near to 1, than 200).

--------------------- Edited ---------------------------

I explain a bit more my problem.

I want to make segmentation based in dates values. If I use for example K-means if I use the day of year: 1,20,365 the algorithm will think that 365 is far from 1, but the true is that is really near.

I want to know how can I normalize the data to have near values if they are near in the realtime, (I usually use the mean and max diff approximation but with that case it shouldn't work).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
pianista
  • 21
  • 3
  • 7
    Dates themselves are not circular values, but *day of year* is. Which one do you mean? And if you have a truly circular value, you should be using [methods of circular statistics](http://stats.stackexchange.com/search?q=circular) with it rather than representing it as a single number. – whuber Aug 27 '15 at 14:20
  • I've updated the a bit the explanation of my problem. – pianista Aug 28 '15 at 10:09

1 Answers1

0

This is a broad question, since what to do will depend on which statistical methods should be used later in the workflow. But take the example in the edit of the Q, kmeans clustering. You would need to use circular mean, see Wikipedia or the book Topics in Circular Statistics. Then you would need an implementation of the kmeans algorithm which can use such a mean! which do not have all the nice properties of the usual arithmetic mean. I don't know of such an implementation, maybe see Why does k-means clustering algorithm use only Euclidean distance metric?.

Some unusual properties of the circular mean in the case of day of year: Two dates, 1. january and 1. of july. What is the mean? 1 of april or 1. october? Both work equally well ... and if you perturb this example somewhat, you can see the definition is unstable, small changes in one daynumber can lead to a very different mean. But this cannot occur if the dispersion is sufficiently small (again, we need a special definition of dispersion for this case.) See the links above.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467