0

I’m looking to implement density-based clustering with R or Mathematica on a giant file (600,000 points on a 3 billion x 3 billion plane). Is DBSCAN the right method for data that is this sparse? I am also anticipating a huge amount of noise. Which parameters would I tweak to account for this?

nhwalnut
  • 31
  • 3
  • 1
    Why the close votes? OK, so the user wants to ultimately use R or Mathematica, this is not a code request question.There are some clear statistical aspects in the OP's question (ie. the use of DBSCAN with a large sparse noisy dataset). Maybe one can suggest the question [here](http://stats.stackexchange.com/questions/11418/density-based-spatial-clustering-of-applications-with-noise-dbscan-clustering) as being a possible duplicate (I think it is not) but that's a totally different criticism. – usεr11852 Jul 20 '16 at 22:37

2 Answers2

1

Given you have a large dataset data as well as noisy data I strongly recommend that a dimensional reduction step is done prior to clustering. This should allow potentially irrelevant variation to be filtered out and the clustering algorithm to work in a lower dimensional space. Standard dimensional reduction techniques like Principal Component Analysis (PCA) and Locality-sensitive hashing (LSH) are two standard approaches.

Detecting density-based clusters in high-dimensional spaces, even when having noiseless data can be very demanding. High-dimensional density estimation is a typical scenario where the curse of dimensionality manifests. DBSCAN ultimately relies on finding fixed-radius nearest neighbours for each point. As the dimensionality of the data increases this nearest neighbour (NN) finding task becomes more and more attenuated. In addition (and most importantly) a standard distance metric as Euclidean distance gets potentially increasingly irrelevant. Therefore even if we have a distance and neighbourhood to work with that information is not very useful. This association between curse of dimensionality and NN-related tasks has been touched upon many time in CV, eg. see 1, 2, 3, 4.

By the way, something "simple" like the following script where $N$ is quite larger than just 600$k$ as in your case, runs on my laptop (Intel i5 U-series) in under 5 minutes using ~10 GB of RAM. This is because the fixed-radius nearest neighbour problem mentioned above is solved within the library dbscan using $k$-d trees; most NN-finding routines use some approximation approach; otherwise even cases with just a few more than tenths of thousands of points would get prohibitively large to work with when involving $O(n^2)$ requirements. So while not "instant" a use-case with 600$k$ points is definitely doable in R given a standard workstation and some appropriate dimensional reduction.

N = 10^7; p = 50;
Q = matrix(nrow = N, rt(N*p, df = 2))
library(dbscan)
W = dbscan(Q, eps= 0.01, minPts = 100) # About 4.5'
usεr11852
  • 33,608
  • 2
  • 75
  • 117
0

DBSCAN will work, but you need a good index structure. Without a good index, it will be $O(n^2)$ which is too expensive.

If you have a lot of noise, increase minPts.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Can you kind of expand a bit? I always enjoy reading you insightful answers/comments on clustering issue but this post needs some extra context. How does `minPts` associate with what the user might experience noise-wise? How an index would help the time complexity? How much? Why would it be $O(n^2)$ to begin with? – usεr11852 Jul 20 '16 at 22:32
  • See the Wikipedia article for DBSCAN complexity. If you have a lot of noise, a larger sample provides a more reliable density estimate. – Has QUIT--Anony-Mousse Jul 21 '16 at 07:41
  • That section of the Wikipedia article, while relating to your answer on the use of index structures, has nothing to do with the *noise* against `minPts` relation. You might want(ed) to comment that with larger `minPts` parameters is more probable that outliers/noisy data are not incorporated erroneously in a cluster/neighbourhood but that is not explained in your post or your comment... – usεr11852 Apr 01 '17 at 10:14