1

I’ve got several thousand observations in 350-dimensional space, in a relatively sparse matrix (median observation has 11 non-zero dimensions). I'm using a density-based clustering algorithm, DBSCAN, to identify clusters and noise points (points that do not fall into clusters).

DBSCAN requires setting two parameters. The first is the minimum threshold number of points that defines a cluster. The guidance in the original paper (Ester et al. 1996) is to use the number of dimensions in the space. In a prior post (where I asked way too many questions) a helpful forum user pointed out that because the value of most dimensions is zero for most observations, the intrinsic dimensionality of my dataset is much lower than 350, and is more like the median or mean number of nonzero dimensions.

I've been reading about intrinsic dimensionality, but it is well outside my expertise. I am looking for a paper or two that I can cite (and learn more from) that documents why the intrinsic dimensionality is closer to the median number of nonzero dimensions. Any help is much appreciated!

herfa
  • 41
  • 3
  • It may help to look at my answer here: [Why are there only $n−1$ principal components for n data if the number of dimensions is $≥n$?](https://stats.stackexchange.com/a/123349/7290). It should be pretty easy to see that the literal dimensionality is $3$, but the intrinsic dimensionality is $1$. You might also find the discussion & link helpful in my answer here: [How do I know my k-means clustering algorithm is suffering from the curse of dimensionality?](https://stats.stackexchange.com/q/232500/7290) – gung - Reinstate Monica Jun 27 '18 at 19:59

1 Answers1

1

First of all, you can compute the actual intrinsic dimensionality for some points, and just give the data.

Secondly, it should be obvious that adding a constant dimension does not affect the intrinsic dimensionality. Thus, if for most points most dimensions are 0, the intrinsic dimensionality must be lower.

You don't need a citation for that.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96