1

I'm looking for a metric of dispersion for data in a high dimensional space (say, 70 or so components). The idea is that my data has several groups that I suspect are similar within some or many of the components (low dispersion) and other groups are quite varied (high dispersion). I also suspect that if I calculate this dispersion metric on all of the data, it will be greater than any one group.

Dispersion statistics are typically defined (as far as I can tell) for a single component/variable; things like standard deviation, IQR and MAD. I can calculate these per-component, and I have tried to come up with a single metric by taking the average or the squareroot of the sum, but the results I see don't match my intuitions.

I don't expect any one component to be Gaussian or any particular distribution at all, and certainly not the full 70 dimensions to be multivariate Gaussian, either.

What statistics are available to describe such dispersion? If possible, examples calculated in Python/pandas are appreciated.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
pixels
  • 529
  • 5
  • 12
  • 1
    Are you interested in / trying to cluster your data? Do you just want a multi-dimensional measure of variance? – gung - Reinstate Monica Aug 12 '19 at 19:44
  • Just a multi-dimensional measure of variance would be valuable. – pixels Aug 12 '19 at 20:48
  • 1
    Related: [A measure of overall variance from multivariate Gaussian](https://stats.stackexchange.com/q/50389/), & [What does Determinant of Covariance Matrix give?](https://stats.stackexchange.com/q/110955/) – gung - Reinstate Monica Aug 12 '19 at 21:08
  • Interesting. The determinant of the covariance of my data is exceptionally small (1e-200 and less). The relative scales do match my intuitions, however.. – pixels Aug 13 '19 at 16:11

0 Answers0