Longitudinal k-means sample data

Question

Having finished the Coursera's Machine Learning course, I would like to put the theories into practice. Thanks in advance on guiding a newbie!

In particular, I am looking forward to some guidance how to:

Some sample longitudinal data that would illustrate k-means grouping
how to include time dimension into the analysis? Say if I collected 10 days worth of data, capturing long/lat every 5 minutes, I would expect at hour x every day there is a pattern.

score 1 · Answer 1 · answered Sep 05 '14 at 22:15

There are a number of very good references on this matter. Three I can immediately think of are:

Functional clustering and identifying substructures of longitudinal data by Chiou and Li (2007)
Clustering for Sparsely Sampled Functional Data by James and Sugar (2003) and
Distance-based clustering of sparsely observed stochastic processes by Peng and Mueller (2008)

For your particular problem, I would argue (in very short) that instead of doing the $k$-means on the data matrix themselves you calculate the principal components of your data (clearly you do this after smoothing and interpolating your data on a common grid). You would then perform the $k$-means clustering on the principal components' scores. This two-step approach will almost certainly allow you to visualize your data clustering more effectively.

Other approaches (mostly on non-parametric clustering) also exist but I think they are an overkill at this point. Jacques and Preda (2013) have recently provided an excellent survey on the matter: Functional data clustering: a survey (I tried to link to author-provided reprints where possible).

I think that, since the data set could also be considered as _time series_, _dynamic time warping (DTW)_ approach is applicable as well: http://stats.stackexchange.com/a/131284/31372. — Aleksandr Blekh, Jan 19 '15 at 09:48

score 0 · Answer 2 · answered Jan 29 '14 at 00:31

Google's My Track android app allows output of long/lat. However, I wrote my own client to capture the data every 5 minutes.
Time dimension - depending on how you want to do it... I "normalize" the data upfront to do every hour, so the grouping makes more sense. For example

2

37.88    -122.22    11
37.88    -122.22    11
37.88    -122.22    11
37.88    -122.22    11
37.33    -122.50    12
37.33    -122.51    12
37.33    -122.52    12

The k-means algorithm, if implemented properly, can handle a matrix. Using the coursera's machine learning exercise #8, I modified it to handle/visualize 3 dimensional data. Not too bad.

I don't think more than 3 dimensions can be visualized, though a vectorized implementation will still work.

Cheers, Simon

Longitudinal k-means sample data

2 Answers2