0

I have the following data:

      type    distance
0      X      12572
1      X      11229
2      Y      14144
3      A      15781
4      A      15486
5      B      461
6      X      328
7      X      23
8      X      50
9      A      45
10     A      231
11     A      10779
12     X      11433
...      .....

type refers to the data points category. distance is the distance between each data point. That is, the difference between X index 0 and X index 1 is 12572, the difference between the second and third datapoint is 11229, etc.

One can think of this set of datapoints as being along one dimension. The identity (i.e. type) of the datapoint is irrelevant to this problem. I am interested somehow inferring the "clusters" of data points which occurs when datapoints are spaced closely together. In this case, it looks clear that the datapoints from index 5-11 consist of one grouping.

One-dimensional clustering algorithms come to mind. However, there is a natural structure to this dataset; if the distances are less than 10,000, normally there's a cluster. Simply binning by hand might be more important.

Is there a method for this problem based in probabilistic inference? Either there could be a way to infer the "natural" clustering within a given dataset (though that's ill-defined) or perhaps use part of the dataset as a training set?

ShanZhengYang
  • 663
  • 1
  • 6
  • 12
  • The twist in your question is the existence of this prior criterion of 10,000. However, by rephrasing it as "any gap of distance larger than 10,000 must separate the clusters," you can perform a preliminary partition of the data by identifying all such gaps and then separately cluster the data within each component of that partition using the techniques described in the duplicate. – whuber May 30 '17 at 13:47
  • @whuber "any gap of distance larger than 10,000 must separate the clusters" This is somewhat tautological/circular I feel, and is equivalent to binning. If there are distances less than 10,000, then this is a cluster of points. Once partitioning the data to identify all such gaps as you suggest, isn't the problem then solved? – ShanZhengYang May 30 '17 at 13:50
  • It's also not clear to me how k-means helps here. k=2? That is, there are two groupings to this data, and then I should mark the identity of each column as either in cluster 1 or cluster 2? – ShanZhengYang May 30 '17 at 13:53
  • The problem, as you seem to have expressed it, is not solved after splitting across all gaps of 10000 or larger: it remains to cluster the data that have not been split. For instance, in the series 1, 2, 3, 20000, 20001, 20002, 29000, 38000, 38001, you would initially split it into just two series 1,2,3 and 20000, ..., 38001, because there is only one gap larger than 10000. It remains to cluster each of those subseries. There's nothing circular about this approach. But if this is not what you intended to describe, then please feel free to change your post to make the question clearer. – whuber May 30 '17 at 14:16
  • @whuber I remain confused---I think there's something unclear above. Can you explain what you mean by "subseries", i.e. "It remains to cluster each of those subseries."? In the actual dataset, distances are either around ~10-15K or not. There are no 20K, 38001, etc. – ShanZhengYang May 30 '17 at 14:50
  • Could you explain in more detail what your data mean? What exactly is a "point"; what does the `distance` variable represent; what does the "difference" between two points mean; and how do you measure the condition of being "spaced closely"? – whuber May 30 '17 at 14:52
  • @whuber Distance is the distance from one point to the next. We're trying to find something to identify which points are grouped together essentially on a line. A point really just a point---you can imagine this as modeling people queueing on along a line. The "background" distances are around ~10-15K. Less than this e.g. 20-3000 is "signal", i.e. the clusters. – ShanZhengYang May 30 '17 at 15:38
  • Okay: it appears that your references to 10,000 make no difference: you just want to cluster points in one dimension. Accumulate those distances into actual coordinates and apply the methods in the duplicate. – whuber May 30 '17 at 15:59
  • @whuber So....k-means with k=2? – ShanZhengYang May 30 '17 at 16:02
  • Or any other value of $k$ you like. Or hierarchical cluster analysis. Or anything else: literally any higher-dimensional clustering algorithm can be applied, without change, to your data. – whuber May 30 '17 at 16:06
  • @whuber I don't fully understand. If there are two categories, I would think that I might end up with several clusters for "background", etc. Anyways, last (and separate) question: Does an HMM approach make sense here at all? Do you see any relevance? – ShanZhengYang May 30 '17 at 16:10
  • With two categories, you will not end up with separated clusters: one will consist of all points at locations less than a threshold $T$ and the other will consist of the remaining points. I don't know how you would apply HMM and, because I know nothing about these data, what they mean, or what you're trying to learn about them, I am in no position to offer an opinion on the relevance of any particular technique. The only justifiable solutions to an abstractly described problem must themselves be abstract. – whuber May 30 '17 at 16:15
  • @whuber Apologies, typo: if there are more than 2 categories (i.e. k>2 or perhaps a higher-dim. approach). Fair enough---thank you for the help – ShanZhengYang May 30 '17 at 16:19
  • With *any* number $k$ of categories, you won't have clusters punctuated by elements of other clusters. The solution will always be given by a set of $k-1$ thresholds $T_i$ that partition the data. You can prove this by supposing it's false and then showing there is a way to improve the clustering solution. – whuber May 30 '17 at 16:24

0 Answers0