5

Having finished the Coursera's Machine Learning course, I would like to put the theories into practice. Thanks in advance on guiding a newbie!

In particular, I am looking forward to some guidance how to:

  1. Some sample longitudinal data that would illustrate k-means grouping

  2. how to include time dimension into the analysis? Say if I collected 10 days worth of data, capturing long/lat every 5 minutes, I would expect at hour x every day there is a pattern.

Alexis
  • 26,219
  • 5
  • 78
  • 131
simonso
  • 151
  • 1

2 Answers2

1

There are a number of very good references on this matter. Three I can immediately think of are:

  1. Functional clustering and identifying substructures of longitudinal data by Chiou and Li (2007)
  2. Clustering for Sparsely Sampled Functional Data by James and Sugar (2003) and
  3. Distance-based clustering of sparsely observed stochastic processes by Peng and Mueller (2008)

For your particular problem, I would argue (in very short) that instead of doing the $k$-means on the data matrix themselves you calculate the principal components of your data (clearly you do this after smoothing and interpolating your data on a common grid). You would then perform the $k$-means clustering on the principal components' scores. This two-step approach will almost certainly allow you to visualize your data clustering more effectively.

Other approaches (mostly on non-parametric clustering) also exist but I think they are an overkill at this point. Jacques and Preda (2013) have recently provided an excellent survey on the matter: Functional data clustering: a survey (I tried to link to author-provided reprints where possible).

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • I think that, since the data set could also be considered as _time series_, _dynamic time warping (DTW)_ approach is applicable as well: http://stats.stackexchange.com/a/131284/31372. – Aleksandr Blekh Jan 19 '15 at 09:48
0
  1. Google's My Track android app allows output of long/lat. However, I wrote my own client to capture the data every 5 minutes.

  2. Time dimension - depending on how you want to do it... I "normalize" the data upfront to do every hour, so the grouping makes more sense. For example

2

37.88    -122.22    11
37.88    -122.22    11
37.88    -122.22    11
37.88    -122.22    11
37.33    -122.50    12
37.33    -122.51    12
37.33    -122.52    12

The k-means algorithm, if implemented properly, can handle a matrix. Using the coursera's machine learning exercise #8, I modified it to handle/visualize 3 dimensional data. Not too bad.

I don't think more than 3 dimensions can be visualized, though a vectorized implementation will still work.

Cheers, Simon

simonso
  • 151
  • 1