50

What would be the approach to use Dynamic Time Warping (DTW) to perform clustering of time series?

I have read about DTW as a way to find similarity between two time series, while they could be shifted in time. Can I use this method as a similarity measure for clustering algorithm like k-means?

Seanny123
  • 625
  • 4
  • 20
Kobe-Wan Kenobi
  • 2,437
  • 3
  • 20
  • 33
  • 2
    yes, you could use similarity measure as an input to k means clustering and then determine groups in your data. – forecaster Jan 05 '15 at 15:40
  • Thank you for your answer Sir. I'm guessing that for each iteration I would need to form the distance matrix for each (centroid, clustering point) couple, and recalculate centroids in standard fashion, as a mean of all series that belong to cluster? – Kobe-Wan Kenobi Jan 05 '15 at 15:57
  • No, you basically calculate similarity matrix and the cluster using kmeans. I have used SAS very successfully in the past. Here is an [example](http://support.sas.com/documentation/onlinedoc/ets/ex_code/131/smyex05.html). – forecaster Jan 05 '15 at 16:00
  • 1
    Aleksandr Blekh in the answer below has a blog post that provides a detailed example on how to do this in R. – forecaster Jan 05 '15 at 16:12
  • 3
    @forecaster do *not* use k-means with DTW. k-means minimizes variance, not distances. Variance is squared Euclidean, but that does not mean k-means could optimize other distances. The mean doesn't, and in DTW it should be rather easy to construct counterexamples, like a sine wave offset by $\pi$: both are very similar by DTW, but their mean is constant zero - very dissimilar to both. – Has QUIT--Anony-Mousse Jan 05 '15 at 22:30
  • @Anony-Mousse makes sense, I used ward clustering method. – forecaster Jan 05 '15 at 22:43
  • 1
    K-means is not an appropriate algorithm for time series clustering. Hidden markov models for discrete, longitudinal data are appropriate. There are several books out now on this topic as well as key contributions from Oded Netzer (Columbia) and Steve Scott (Google). Another approach would be the information-theoretic method developed by Andreas Brandmaier at Max Planck called permutation distribution clustering. He has also written an R module. Comparison of cluster solutions is a different issue. Marina Meila's paper, Comparing Clusterings, U of Washington Statistics Tech Report 418 is best. – Mike Hunter Jun 05 '15 at 20:26
  • https://github.com/neqkir/2d-correlation-time-warping – kiriloff Oct 03 '21 at 19:07

5 Answers5

63

Yes, you can use DTW approach for classification and clustering of time series. I've compiled the following resources, which are focused on this very topic (I've recently answered a similar question, but not on this site, so I'm copying the contents here for everybody's convenience):

oerpli
  • 103
  • 5
Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93
  • 3
    +1 excellent collection of articles and blogs. Very good references. – forecaster Jan 05 '15 at 16:02
  • @forecaster: Thank you for the upvote and kind words! Glad you like the collection. It's too sad that currently I don't have time to learn forecasting and many other areas of statistics and data science more seriously, but I use every opportunity to learn something new. – Aleksandr Blekh Jan 05 '15 at 16:13
  • 1
    @AleksandrBlekh Thanks very much you for your answer, I've been discussing with Anony-Mousse about this aproach, since I'm particularly interested in DTW as a similarity measure for K-means, so I could get centroids as output. What is your opinion and experience with it? As you can see Anony-Mousse gave some arguments that the results may not be so good in this case... Maybe some personal experience in a practical matter? – Kobe-Wan Kenobi Jan 08 '15 at 10:30
  • @pera: You're very welcome. I'm glad you liked it. As for more detailed feedback or guidance, unfortunately, I don't have enough expertise on the subject to offer you more advice. At least, at the present time. I just became interested in the topic and performed brief research. I would listen to Anony-Mousse's advice - I'm sure that he knows what he is talking about. – Aleksandr Blekh Jan 08 '15 at 11:10
  • 1
    Ok, thanks again. You have +1 from me and he gets answer accepted, since my question is more oriented towards k-means and DTW. – Kobe-Wan Kenobi Jan 08 '15 at 15:08
  • 1
    @pera: My pleasure. Thanks for upvoting. Totally understand and agree about acceptance, no problem at all. – Aleksandr Blekh Jan 08 '15 at 16:52
  • an earlier answer said that it was a bad idea to use dtw as a distance measure for k-means clustering, yet two articles linked here (Time Series Classification and Clustering: ipython notebook, Time Series Classification and Clustering with Python: a blog post) use k-means. Is this an error on the author's part? – RECURSIVE FARTS May 19 '20 at 21:37
  • @RECURSIVEFARTS Information in the mentioned blog post and corresponding notebook does not contradict the accepted answer. In the post and notebook, DTW is used only to calculate Euclidean distance, which then follows with k-NN algorithm application. Note the phrase: _"Now that we have a reliable method to determine the similarity between two time series, we can use the k-NN algorithm for classification"_. – Aleksandr Blekh May 20 '20 at 20:40
44

Do not use k-means for timeseries.

DTW is not minimized by the mean; k-means may not converge and even if it converges it will not yield a very good result. The mean is an least-squares estimator on the coordinates. It minimizes variance, not arbitrary distances, and k-means is designed for minimizing variance, not arbitrary distances.

Assume you have two time series. Two sine waves, of the same frequency, and a rather long sampling period; but they are offset by $\pi$. Since DTW does time warping, it can align them so they perfectly match, except for the beginning and end. DTW will assign a rather small distance to these two series. However, if you compute the mean of the two series, it will be a flat 0 - they cancel out. The mean does not do dynamic time warping, and loses all the value that DTW got. On such data, k-means may fail to converge, and the results will be meaningless. K-means really should only be used with variance (= squared Euclidean), or some cases that are equivalent (like cosine, on L2 normalized data, where cosine similarity is the same as $2 -$ squared Euclidean distance)

Instead, compute a distance matrix using DTW, then run hierarchical clustering such as single-link. In contrast to k-means, the series may even have different length.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Thank you for your answer, Sir. I understand your point of view, but I would like to know if there is some way to make DTW work with clustering algorithms that produce centroids, such as K-means? – Kobe-Wan Kenobi Jan 06 '15 at 18:20
  • 6
    Well, there is of course PAM (K-medoids) which works with arbitrary distances. One of the many algorithms which support arbitrary distances - k-means does not. Other choices are DBSCAN, OPTICS, CLARANS, HAC, ... – Has QUIT--Anony-Mousse Jan 06 '15 at 19:29
  • Hmm, thank you. Correct me if I'm wrong, but only K-mediods produces centroid-like result? I'm familiar with K-mediods, but the thing is that I also need a scalable solution that I could use on top of Hadoop, do you possibly have any idea could this algorithm be implemented as a MapReduce? – Kobe-Wan Kenobi Jan 06 '15 at 20:53
  • One more thing, would K-mediods give better results than K-means when used with DTW? What is the most known way of clustering time series? – Kobe-Wan Kenobi Jan 06 '15 at 21:10
  • 1
    Probably. Because k-medoids uses DTW-medoid for finding the cluster center, not the L2 mean. I don't know of any real world successful clustering of time series. I believe I've seen papers, but none that really *used* the result. Only proof-of-concepts. – Has QUIT--Anony-Mousse Jan 07 '15 at 20:54
  • 1
    @Aleksandr Blekh gave this as one of his examples http://nbviewer.ipython.org/github/alexminnaar/time-series-classification-and-clustering/blob/master/Time%20Series%20Classification%20and%20Clustering.ipynb What is your opinion about it? – Kobe-Wan Kenobi Jan 08 '15 at 10:26
  • 1
    Toy problems. Useless in real world. Real data has plenty of noise, which will hurt much more than smooth sine curves and the patterns presented in this data. – Has QUIT--Anony-Mousse Jan 08 '15 at 12:43
  • Ok, I have accepted your answer and I'm having in mind the problems you have presented. Just as a summary, if I'm working with time series and using DTW, you would recommend Hierarchical clustering, or K-mediods, if I need something that will represent my clusters? – Kobe-Wan Kenobi Jan 08 '15 at 15:14
  • 1
    I think hierarchical clustering is the better choice. You won't be able to process a huge number of series anyway. – Has QUIT--Anony-Mousse Jan 08 '15 at 15:31
  • Why not, what is the problem with large number of series? – Kobe-Wan Kenobi Jan 08 '15 at 15:44
  • 1
    A good single-link implementation is $O(n^2)$, naive implementations, other linkages, PAM etc. are more like $O(n^3)$. But you don't want to compe $O(n^2)$ DTW distances either. – Has QUIT--Anony-Mousse Jan 08 '15 at 19:58
  • what about clustering of SAX representations of time series using jmotif or custom implementation ? – Qbik Feb 17 '15 at 22:27
  • SAX are symbolic approximations. Good for pruning in retrieval, but how would you use them in clustering? It doesn't make sense to run k-means on them. – Has QUIT--Anony-Mousse Feb 17 '15 at 22:38
  • @Anony-Mousse this is a great resource for us, thanks! I am curious about your understanding of presenting a prototype time series for a cluster, after clustering. Specifically I mean that after a clustering is done (either pam or hierarchical) I would like to understand the prototypical series within each cluster. The simple solution seems to calculate the mean for each time point per cluster. Is that meaningful after clustering with DTW distance? – B_Miner May 24 '16 at 14:15
  • 1
    @B_Miner: Computing **the mean does not take time warping into account**. So assuming you have two long sine signals, shifted by half their phase (and allow shifts of this length in DTW) they would be very similar by DTW, but would average to 0. So no, the mean may be meaningless on such data. I'd go with the medoid if you use PAM. – Has QUIT--Anony-Mousse May 24 '16 at 17:38
  • @Anony-Mousse what about using DTW and hierarchical clustering with a method requiring Euclidian distances such as 'centroid', 'median' or 'ward'? I have seen [other questions](https://stats.stackexchange.com/questions/137291/hierarchical-clustering-linkage-methods-and-dynamic-time-warping) discussing this, but it is unclear whether it can result in a valid solution..? – DimP Apr 14 '17 at 19:20
  • What's the explanation/motivation preferring linkage='single'? – yaseco Oct 22 '19 at 13:16
  • Easy to understand what it does, hence a good starting point. – Has QUIT--Anony-Mousse Oct 22 '19 at 23:06
1

A recent method DTW Barycenter Averaging (DBA) has been proposed by Petitjean et al. to average time series. In an other paper they proved empirically and theoretically how it can be used to cluster time series with k-means. An implementation is provided on GitHub by the authors (link to code).

1 F. Petitjean, G. Forestier, G. I. Webb, A. E. Nicholson, Y. Chen and E. Keogh, "Dynamic Time Warping Averaging of Time Series Allows Faster and More Accurate Classification," 2014 IEEE International Conference on Data Mining, Shenzhen, 2014.

2 F. Petitjean, P. Gançarski, Summarizing a set of time series by averaging: From Steiner sequence to compact multiple alignment, Theoretical Computer Science, Volume 414, Issue 1, 2012

1

Dynamic Time Warp compares the realized data points, which may or may not work. A more rigorous approach is to compare the distribution of the time series by way of a metric called telescope distance.

The cool thing about this metric is that the empirical calculation is done by fitting a series of binary classifiers such as SVM.

For a brief explanation, see this.

For clustering time series, it's been shown to outperform DTW; see Table 1 in the original paper[1].

[1] Ryabko, D., & Mary, J. (2013). A binary-classification-based metric between time-series distributions and its use in statistical and learning problems. The Journal of Machine Learning Research, 14(1), 2837-2856.

horaceT
  • 3,162
  • 3
  • 15
  • 19
0

Yes. A naive and potentially slow approach might be,

  1. Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
  2. For each series, calculate distances via DTW for each center in each cluster groups and assign it to the minimum one.
  3. For each cluster groups, calculate total distance within individual clusters.
  4. Choose the minimum.

I used this for a small project. Here is the my repository about Time Series Clustering and my other answer about this.

Dogan Askan
  • 101
  • 3