1

I have a simple problem, which I think must have an easy solution.

I have a vector space say with a 1000 dimensions for each vector.

Now, I have a large number of sample vectors from this vector space.

I also have a small subset of these vectors, I have to tell which subset of the vectors lies in the densest region of the vector space.

Now, I am confused as to how to find this, when I take the subspace of the smaller subset, I will still get 1000 dimensional vectors right? Those vectors will still span the entire vector space?

How to sort these subsets depending on the density of the vector space spanned by each of these subsets?

user158565
  • 7,032
  • 2
  • 9
  • 19
Rafael
  • 1,109
  • 1
  • 13
  • 30
  • 1
    I could not understand this `I also have a small subset of these vectors, I have to tell which subset of the vectors`. – ttnphns Nov 01 '18 at 17:40
  • @ttnphns yes, i mean i have a large number of vectors which occupy a n dimensional space. i have a bunch of vectors, not from that collection. i want to answer: which of the given bunch of vectors has in some sort the highest density in the space encompassed by the original large set of vectors? DO i measure projectstion of each of the vectors onto the new vector t answer that or is there some more efficient way. – Rafael Nov 03 '18 at 02:12
  • If cluster analysis is not the solution could you explain what limitations it has/how it is inappropriate for your needs? – ReneBt Nov 07 '18 at 01:09
  • @ReneBt how could i use cluster analyis as the solution? saying that a new point belongs to a cluster != saying that it is in the densest region of the vector space, right? – Rafael Nov 07 '18 at 17:13
  • Yes, cluster analysis identified regions of higher density within the vector space, with the group centroid corresponding to the vector shape associated with that group. Cluster analysis would identify these subset hotspots, then you can identify which of them has the highest density – ReneBt Nov 08 '18 at 06:05

1 Answers1

4

There are probably many ways to do this, depending on what you use to measure the "density" of each point in the vector space. However, one obvious way to make this comparison would be to use your original vectors to form a kernel density estimator (KDE) over the vector-space and use this estimated density to measure the average "density" of a new set of values in that vector space.

Details: Consider the case where you have an original set of $n$ vectors $\mathbb{x}_1,...,\mathbb{x}_n \in \mathbb{R}^m$ that you are using to determine the density in the vector space. Define the KDE as:

$$p(\mathbb{x}|\lambda) \equiv \frac{1}{n \lambda} \sum_{i=1}^n K \Big( \frac{\mathbb{x} - \mathbb{x}_i}{\lambda} \Big) \quad \quad \quad \text{for all } \mathbf{x} \in \mathbb{R}^m.$$

The function $K$ is an appropriate density on $\mathbb{R}^n$ with zero mean, and $\lambda$ is a bandwidth parameter. You should be able to estimate the bandwidth parameter from the observed data (via maximum likelihood estimation or using some other method) to get the estimated density function:

$$\hat{p}(\mathbb{x}) = p(\mathbb{x}|\hat{\lambda}).$$

You now have an estimated density on the vector space $\mathbb{R}^m$. For any new set of values in this vector space, you can measure their mean-log-density as:

$$\begin{equation} \begin{aligned} \hat{\bar{\ell}}(\mathbf{y}_1,...,\mathbf{y}_N) &\equiv \frac{1}{N} \sum_{k=1}^N \ln \hat{p}(\mathbf{y}_k) \\[6pt] &= \frac{1}{N} \sum_{k=1}^N \Bigg[ \ln \Bigg( \sum_{i=1}^n K \Big( \frac{\mathbb{y}_k - \mathbb{x}_i}{\hat{\lambda}} \Big) \Bigg) - n \ln n - n \ln \hat{\lambda} \Bigg]. \end{aligned} \end{equation}$$

This measure is essentially giving you an estimate of the mean-log-density of the new values, where the density function is estimated from the original observations. For a given set of new values in the vector space, this ought to give you a reasonable measure of how "dense" a region these values occupy.

As pointed out in the comments to this answer, there are some difficult issues that arise when you measure distances between vectors in high dimensional spaces. In particular, in high dimensional spaces, most of the "density" of a distribution is not near its mean. The present method is general enough to allow you to choose a density kernel that you think gives an appropriate measure of distance between vectors. When working in a high-dimensional space you will probably want to choose something that has "fat tails" so that it does not heavily penalise distances in a single dimension. In any case, the method allows a choice here.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • In such a high dimensional space, wouldn't a distance such as $||x-x_i||$ fail to distinguish between points [as explained here](https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions)? I understand that here, the notion of "dense" hasn't been defined by OP, but I'm not sure if such a notion can exist given the dimension in the question. – adityar Nov 05 '18 at 12:07
  • @InfProbSciX: I have not used that distance metric in this answer, and the method is broad enough to accommodate a range of possible measures of distance between the vectors. Following your comment, I have added an additional paragraph in the answer to note this issue, and note the generality of the proposed method. – Ben Nov 07 '18 at 00:23
  • I had just assumed that $K(x-x_i/\lambda)$ would depend on the distance in some fashion (which is the case with the normal density/squared exponential), but regardless, this is an interesting answer. I'd love to do some simulations to see its effectiveness. – adityar Nov 07 '18 at 00:30
  • @InfProbSciX: Yeah, it is an interesting problem. It would be interesting to see the computational requirements of this method, when you have a large number of points in high dimensions. – Ben Nov 07 '18 at 00:33
  • @Ben which kernel function should be used? are there any other methods as well, if you could point to them as well, it would be great! thanks :) – Rafael Nov 07 '18 at 17:15
  • @Rafael: The method should be fairly robust to different kernels, so I think that decision would be made mostly on the grounds of computational simplicity (where my expertise unfortunately gets very thin). You could start by trying a Gaussian kernel, but as ProbInfSciX points out, this might lead to some difficulties in high dimensional spaces. Personally, I would try a couple of different kernels, see if the problem is computable, and see how robust your answer is to those choices. – Ben Nov 07 '18 at 22:38
  • @Ben thanks! if i want to adapt my solution based on a distance metric, then what distance metric would be suitable? you mentioned something "fat-tailed" distribution as kernel, any particular such distribution? – Rafael Nov 07 '18 at 22:47
  • @Rafael: Well, as you can see above, the method I suggest gives you terms $K((\mathbf{y}_k - \mathbf{x}_i)/\hat{\lambda})$, which are effectively distance terms between the vectors. Any kernel you use will lead to an implicit distance measure of this form. If you use a "fat tailed" kernel then the distance measure will penalise large deviations less than if you use a thin-tailed kernel. – Ben Nov 07 '18 at 23:17
  • @Rafael: You could probably also work backwards - specify a desired distance metric and set this as that kernel value (best to still include a bandwidth term). – Ben Nov 07 '18 at 23:19