Why might k-means be inappropriate for contextual time series data?

Question

Why is k-means not considered a good option to use for time series data? I have read answers saying it is not a good option, but none that describe why.

Let's say we have a time series data of a shop just for one day. On the x-axis, we have a range of 24 hours. On the y-axis the number of sold shirts. Let's make the data contextual as well. Say, the number of sales at 10:00 and 15:00 are really high. Why would using k-means be a bad idea here?

Can you link to some of the answers that you refer to? Plus: how do you want to apply k-means (or any other clustering algorithm) to a single time series? What do you want to cluster? — Stephan Kolassa, Oct 26 '17 at 09:54
Is this your real example? Clearly examples can be hypothetical, but it doesn't help if they seem utterly implausible. It's hard for me to imagine any possible motive for clustering such a time series. Conversely, saying why that it is of interest or use may help to guide answers. — Nick Cox, Oct 26 '17 at 10:38
Totally cannot use K-means. Better use another statistical tool. — Rosbert, Oct 26 '17 at 09:50

jtromans · Answer 1 · 2018-06-11T12:55:08.850

* Update *

As a comment mentions, Keogh and Lin have a great paper on this subject. To my mind, they conclusively show that although the technique itself is applied commonly as detailed in my answer below, the usefulness of this technique is called into questions. Read the paper yourself, it is well worth it: http://www.cs.ucr.edu/~eamonn/meaningless.pdf

* Update *

I have seen k-means used successfully on time-series data.

The general concern with k-means is due to the number of dimensions. More specifically, when you structure a times-series problem for the k-means algorithm, you usually have windows/stride of data within which contiguous portions of time-series reside. You will usually produce many window/stride samples that could be non-overlapping, or over-lapping, depending on your use-case.

In this situation, each time-step within the stride is considered a dimension by the k-means algorithm. In such a situation, if the number of time-steps within your window is large, you are introducing a lot of dimensionality: each time-step is cast as a dimension. It is in this case that k-means usually suffers.

Example

If you have a raw time series comprised by n time-steps, your objective might be to establish whether any arbitrary period of time within this time-series is similar to a previous period.

Preparation may include creation of a rolling window of width/stride-length m time-steps:

ts = [1,2,3,4,1,2,9,8,3,7,9,8,1,7,9,8,1,7,2,9,3,8,7,1,9,8,2,7,3]
window1 = [1,2,3]
window2 = [2,3,4]
window3 = [3,4,1]
window4 = [4,1,2]
...

Where m is 3 in this example. You now have a number of samples created from your original data, where each sample has a dimension of 3. The stride/window length m and dimensional are the same thing. Imagine each time-step within the window is a different independent variable from your experiment. Using the k-means algorithm as normal you could now produce clusters that are prototypical of your underlying time-series.

You can read more on this topic here: How to understand the drawbacks of K-means

Furthermore, it is a known problem that understanding when two time-series are 'close together' in Euclidian space can be misleading. A good write-up on this topic with code is available here: http://alexminnaar.com/time-series-classification-and-clustering-with-python.html. In short, finding a similarity measure can be challenging, and Dynamic Time Warping is one option to help address this.

Thanks for your answer. But how exactly is each time step is considered as a dimension? I'm not sure I understand that part. Could you please explain? Thank you very much — RPT, Oct 26 '17 at 10:10
Clustering subsequences in a rolling window as you suggest has been shown to produce results that misleadingly appear structured, but are essentially random. See Keogh and Lin (2005) *Clustering of time-series subsequences is meaningless: implications for previous and future research* — user20160, May 22 '18 at 21:52

score 2 · Answer 2 · answered Oct 26 '17 at 19:17

KMeans never compares neighboring values.

So two time series that are zero and peak at 15:00 respectively 14:59 are maximally different. That is why you want time warping to add some tolerance in time.

Furthermore, kmeans cannot use time series of different duration or resolution. If your data is 24 dimensions for exactly 24 hours you don't notice. But many time series aren't aggregated this way. KMeans also can't handle missing data.

score 1 · Answer 3 · answered Oct 26 '17 at 12:43

It is possible to use Kmeans for several (many) time series. I did it and it very slightly improved performance. Actually some other method were far better

In a many time-series context, using KMeans would naturally split series into clusters and do a different training of $y_{t+1}=f(y_1,y_2,...y_t)$ (like linear regression for example) on each cluster.

What is essential is the distance you choose. If you use just rough $L_2$ basic distance it won't work. The distance should reflect the idea that curves are "similar". It's a lot of work: scaling, affine transformation, time translation, maybe even time warping... Ideally the small distance between two curves would reflect the idea that "something likely to have the same consequences on the future has happened in a past".

And still, the result may be very disappointing. If you have a clever distance between curves, then nearest neighbours or kernel smoothing might be much better. For me it worked much better.

Why might k-means be inappropriate for contextual time series data?

3 Answers3