0

I have data (with ~50,000 data points) that consists of measurement of two variables.

enter image description here

I wanted to see how "spread out" the scatterplot of each sample is i.e. variance in all (2) dimensions. You can see that the scatter of Sample 2 is more broad. If I simply compare the univariate variances, then I may not see any difference between the two samples (as the data points are just projections in one dimension). To give you a context, these data points denote activity of a protein in two different tasks. Each dot would denote a combination of the two traits. I want to know what is the diversity of these combinations i.e. the area of "activity space" covered by different samples.

This post, A measure of "variance" from the covariance matrix?, suggests different metrics such as trace or the kth root of the determinant of the covariance matrix. I was also considering using the determinant (product of eigenvalues) as it would somewhat represent the total area covered by the data.

If I do use $|\Sigma|^k$ what would be an appropriate statistical test to compare two samples (analogous to F-test)?

WYSIWYG
  • 121
  • 1
  • 9
  • How do you intend to interpret or use your measure of "spread"? That ought to determine the answer. Anything else would just be abstract mathematics, which may be interesting but could be useless or misleading. – whuber Jul 25 '19 at 12:36
  • @whuber the two variables denote two different "activities" of a protein which denote two different traits. There is some correlation between the two activities. The spread would tell me how diverse the traits are in a given population. Perhaps, I can add a picture to explain it properly. I'll do that. – WYSIWYG Jul 25 '19 at 12:42
  • This context admits many possible solutions. For instance, after constructing a numerical measure of *difference* between the two traits within any individual, you could express the population diversity by means of any appropriate summary statistic of those differences, such as their standard deviation, variance, IQR, etc. That already gives you two very large families of choices (difference metric and summary statistic). Please, then, provide some information to select good options within those families. – whuber Jul 25 '19 at 12:47
  • @whuber I edited. Perhaps it is clear now. In this case I just want to see how diverse the combination of traits are. – WYSIWYG Jul 25 '19 at 13:02
  • "Diverse" has myriad meanings and a great many possibilities for quantitative expression. It would help for you to clarify what you mean by "diverse." – whuber Jul 25 '19 at 15:34
  • 1
    @whuber I consider each point in the data as a possible combination of traits (activities). Therefore, the diversity, in this case would be the area of the activity space covered by each sample. – WYSIWYG Jul 25 '19 at 16:23

0 Answers0