According to the explanation here, DBSCAN eps value is just the step size, but the resulting distances in the cluster can be much bigger. I think it means, that all the elements of other clusters are possibly separated from the first one by the distance, higher than eps - but we can draw the chain with chunk size of eps inside any two elements of the cluster. However, when I started to use DBSCAN in scipy, I found that eps I need is surprisingly big:
I have the following (distance) matrix (please, excuse me to not being able to fit it into page):
[[-0. 0.444 0.187 0.421 0.215 0.412 0.249 0.466 0.249 0.47 0.269 0.437 0.299 0.438]
[ 0.444 0. 0.411 0.193 0.416 0.266 0.37 0.322 0.46 0.28 0.446 0.322 0.464 0.356]
[ 0.187 0.411 -0. 0.437 0.185 0.418 0.245 0.483 0.263 0.472 0.244 0.419 0.306 0.411]
[ 0.421 0.193 0.437 -0. 0.423 0.24 0.396 0.326 0.473 0.271 0.424 0.342 0.448 0.371]
[ 0.215 0.416 0.185 0.423 -0. 0.414 0.242 0.465 0.269 0.464 0.211 0.399 0.271 0.406]
[ 0.412 0.266 0.418 0.24 0.414 -0. 0.362 0.276 0.436 0.168 0.378 0.286 0.463 0.301]
[ 0.249 0.37 0.245 0.396 0.242 0.362 -0. 0.426 0.262 0.422 0.197 0.341 0.281 0.428]
[ 0.466 0.322 0.483 0.326 0.465 0.276 0.426 0. 0.415 0.222 0.463 0.271 0.448 0.286]
[ 0.249 0.46 0.263 0.473 0.269 0.436 0.262 0.415 0. 0.441 0.247 0.531 0.268 0.421]
[ 0.47 0.28 0.472 0.271 0.464 0.168 0.422 0.222 0.441 -0. 0.446 0.276 0.473 0.273]
[ 0.269 0.446 0.244 0.424 0.211 0.378 0.197 0.463 0.247 0.446 -0. 0.428 0.173 0.416]
[ 0.437 0.322 0.419 0.342 0.399 0.286 0.341 0.271 0.531 0.276 0.428 -0. 0.459 0.328]
[ 0.299 0.464 0.306 0.448 0.271 0.463 0.281 0.448 0.268 0.473 0.173 0.459 -0. 0.464]
[ 0.438 0.356 0.411 0.371 0.406 0.301 0.428 0.286 0.421 0.273 0.416 0.328 0.464 -0. ]]
You can see, that elements of my vector split into two classes.
I decided to find the maximum distance to the 3rd nearest neighbor across this matrix, which I think should guarantee finding both clusters correctly (because algorithm follows chains):
l2r = L2.ravel()
def third_nearest_neighbor_distance(i, l2r):
dists = [abs(l2r[i]-l2r[j]) for j in range(len(l2r)) if i != j]
dists = np.sort(dists)
return dists[6] # we have values doubled
k3n_d = []
for i in range(0,len(l2r)):
k3n_d.append(third_nearest_neighbor_distance(i, l2r))
eps = np.max(k3n_d)
which gave me eps = 0.058
and no detection of clusters:
from sklearn.cluster import DBSCAN
def cluster(distance_matrix, eps=0.28):
db = DBSCAN(eps=eps,min_samples=2,metric="precomputed")
return db.fit_predict(distance_matrix)
clusters = cluster(L2,eps)
Output:
clusters:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
Taking just the half of the biggest difference in the matrix yields much better result:
diffs = [abs(i-j) for i in l2r for j in l2r if i != j] #differences
eps = np.max(diffs)/2
supposed eps: 0.265617753868
clusters:
[ 0 1 0 1 0 1 0 1 0 1 0 -1 0 -1]
At last, you may have already guessed from my cluster function header that the right eps
is 0.28-0.29 which is huge.
What am I doing wrong and how to derive eps for my case?