I have the following data:
type distance
0 X 12572
1 X 11229
2 Y 14144
3 A 15781
4 A 15486
5 B 461
6 X 328
7 X 23
8 X 50
9 A 45
10 A 231
11 A 10779
12 X 11433
... .....
type
refers to the data points category. distance
is the distance between each data point. That is, the difference between X index 0 and X index 1 is 12572, the difference between the second and third datapoint is 11229, etc.
One can think of this set of datapoints as being along one dimension. The identity (i.e. type
) of the datapoint is irrelevant to this problem. I am interested somehow inferring the "clusters" of data points which occurs when datapoints are spaced closely together. In this case, it looks clear that the datapoints from index 5-11 consist of one grouping.
One-dimensional clustering algorithms come to mind. However, there is a natural structure to this dataset; if the distances are less than 10,000, normally there's a cluster. Simply binning by hand might be more important.
Is there a method for this problem based in probabilistic inference? Either there could be a way to infer the "natural" clustering within a given dataset (though that's ill-defined) or perhaps use part of the dataset as a training set?