0

According to the explanation here, DBSCAN eps value is just the step size, but the resulting distances in the cluster can be much bigger. I think it means, that all the elements of other clusters are possibly separated from the first one by the distance, higher than eps - but we can draw the chain with chunk size of eps inside any two elements of the cluster. However, when I started to use DBSCAN in scipy, I found that eps I need is surprisingly big:

I have the following (distance) matrix (please, excuse me to not being able to fit it into page):

[[-0. 0.444 0.187 0.421 0.215 0.412 0.249 0.466 0.249 0.47 0.269 0.437 0.299 0.438] [ 0.444 0. 0.411 0.193 0.416 0.266 0.37 0.322 0.46 0.28 0.446 0.322 0.464 0.356] [ 0.187 0.411 -0. 0.437 0.185 0.418 0.245 0.483 0.263 0.472 0.244 0.419 0.306 0.411] [ 0.421 0.193 0.437 -0. 0.423 0.24 0.396 0.326 0.473 0.271 0.424 0.342 0.448 0.371] [ 0.215 0.416 0.185 0.423 -0. 0.414 0.242 0.465 0.269 0.464 0.211 0.399 0.271 0.406] [ 0.412 0.266 0.418 0.24 0.414 -0. 0.362 0.276 0.436 0.168 0.378 0.286 0.463 0.301] [ 0.249 0.37 0.245 0.396 0.242 0.362 -0. 0.426 0.262 0.422 0.197 0.341 0.281 0.428] [ 0.466 0.322 0.483 0.326 0.465 0.276 0.426 0. 0.415 0.222 0.463 0.271 0.448 0.286] [ 0.249 0.46 0.263 0.473 0.269 0.436 0.262 0.415 0. 0.441 0.247 0.531 0.268 0.421] [ 0.47 0.28 0.472 0.271 0.464 0.168 0.422 0.222 0.441 -0. 0.446 0.276 0.473 0.273] [ 0.269 0.446 0.244 0.424 0.211 0.378 0.197 0.463 0.247 0.446 -0. 0.428 0.173 0.416] [ 0.437 0.322 0.419 0.342 0.399 0.286 0.341 0.271 0.531 0.276 0.428 -0. 0.459 0.328] [ 0.299 0.464 0.306 0.448 0.271 0.463 0.281 0.448 0.268 0.473 0.173 0.459 -0. 0.464] [ 0.438 0.356 0.411 0.371 0.406 0.301 0.428 0.286 0.421 0.273 0.416 0.328 0.464 -0. ]]

You can see, that elements of my vector split into two classes.

I decided to find the maximum distance to the 3rd nearest neighbor across this matrix, which I think should guarantee finding both clusters correctly (because algorithm follows chains):

l2r = L2.ravel()

def third_nearest_neighbor_distance(i, l2r):

    dists = [abs(l2r[i]-l2r[j]) for j in range(len(l2r)) if i != j]

    dists = np.sort(dists)

    return dists[6] # we have values doubled


k3n_d = []

for i in range(0,len(l2r)):
        k3n_d.append(third_nearest_neighbor_distance(i, l2r))

eps = np.max(k3n_d)

which gave me eps = 0.058 and no detection of clusters:

from sklearn.cluster import DBSCAN

def cluster(distance_matrix, eps=0.28):


    db = DBSCAN(eps=eps,min_samples=2,metric="precomputed")
    return db.fit_predict(distance_matrix)

    clusters = cluster(L2,eps)

Output:

clusters:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

Taking just the half of the biggest difference in the matrix yields much better result:

diffs = [abs(i-j) for i in l2r for j in l2r if i != j]  #differences
eps = np.max(diffs)/2


supposed eps: 0.265617753868
clusters:
[ 0  1  0  1  0  1  0  1  0  1  0 -1  0 -1]

At last, you may have already guessed from my cluster function header that the right eps is 0.28-0.29 which is huge.

What am I doing wrong and how to derive eps for my case?

Slowpoke
  • 153
  • 1
  • 15
  • Can you please plot your data? When it comes to clustering, plotting your data is probably even more important than with standard regression analysis cases. You may want to consider looking into [OPTICS](https://en.wikipedia.org/wiki/OPTICS_algorithm) if you have problems picking your `eps`. [ELKI](https://elki-project.github.io) has some good implementations. – usεr11852 Oct 01 '17 at 20:46
  • you can check this answer for the correct eps value https://stats.stackexchange.com/a/541340/333125 – Luis Felipe Oct 01 '21 at 21:15

1 Answers1

0

The label -1 is not a cluster.

It's noise.

Also, note that epsilon applies to core points only.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Yes, I know that label -1 is a noise. The result is better in the sense that only two points were not assigned. When I set eps to 0.28 assignment is complete. Also, I actually try to make my points core points: "A point p is a core point if at least minPts points are within distance ε". I am taking maximum distance to the k-th nearest neighbor as eps. – Slowpoke Oct 04 '17 at 08:08
  • For example, [sklearn's default core points number(min_samples) is 5](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). So I should get clustering working when I set `dists[20]` in `third_nearest_neighbor_distance()` (i.e. fifth element computed four times as I compute all the distances in the symmetric matrix). But thus I get still insufficient eps. – Slowpoke Oct 04 '17 at 08:28
  • Define "insufficient", and what's the logic of your approach above? – Has QUIT--Anony-Mousse Oct 04 '17 at 20:54
  • Why would `ravel` be correct, because `L2` is supposed to be a distance matrix? Why flattened it, and then use only the first row? – Has QUIT--Anony-Mousse Oct 04 '17 at 21:37
  • On this step my input matrix (L2 distance matrix...) is just some abstract matrix with some elements. I want to provide eps for clustering algorithm = radius which contains 3-5 nearest points. I am interested only in the value of the this distance, so I might forget about matrix structure and compute pairwise distances between each two elements of the matrix and collect the value of k-th. – Slowpoke Oct 04 '17 at 22:42
  • As long, as each distance is actually computed _four_ times in my inefficient approach (a->b, b->a and one more time because the matrix is symmetric.), I multiply index by four, so I am interested in the 4*k-th distance. (I have an error in the core above, where I take only 2*k). – Slowpoke Oct 04 '17 at 22:42
  • Regarding your question on "insufficient" eps, I will update the post little bit later to show my computation attempt on this matrix and similar ones. – Slowpoke Oct 04 '17 at 22:44
  • The `l2r=L2.ravel()` still does not make any sense. You throw O(n²) values together. Work on rows, not a serialized matrix. – Has QUIT--Anony-Mousse Oct 05 '17 at 05:38