Select ellipsoid data cloud "equivalent" to a spherical data cloud

Question

What will be your considerations when you've got to realize the intuitively plain, but meditatively manifold idea to create a data cloud which "the same as" that data cloud, only that it is not spherical in shape, like it, but is ellipsoid (or perhaps just spheroid)?

For there may be not one ellipsoid candidates, each one "equivalent" to the sphere in some one but not all aspects. I'm just illustrating it now (read only if you've got popcorn, otherwise skip to the pic).

For simplicity, let's consider specifically normal distribution and 2d data case. So, generate spherical data cloud, centered, with s.d.=1; we'll see s.d. as "radius" of the cloud, $r=1$.

Choose the shape of the ellipse you want, its $k=r_1/r_2$ ratio: $r_1$ is the longer radius and $r_2$ is the shorter radius (they actually are equal, respectively, to the two singular values of the elliptic data divided by the square root of the sample size).

OK, let $k=4$, and turn our spherical cloud to that elliptic by proportional stretching-squeezing: $r_1=r\frac{k}{2}=2$, $r_2=r/\frac{k}{2}=.5$. This will be our ellipse1. It is "equivalent" to the spherical cloud in that it inherited its area, 2d volume. Indeed, the volume inside the sphere is $\pi r^2=1\pi$ and the volume inside the ellipse is $\pi r_1r_2=1\pi$. (Under any $k$, the proportional stretching-squeezing will yield ellipse equivolumetric with the parental circle. The volume of a data cloud is the product of its singular values, i.e. sq. root of the determinant of the cov. matrix.)

But what about some other important statistical/geometrical properties of a data cloud, besides the volume?

Multivariate sum of variances (sum of eigenvalues): $r_1^2+r_2^2$. (This quantity is important in euclidean-based statistics because - due to Pythagorean theorem - it equals the mean squared euclidean distance from data points to centroid.) In ellipse1, it is $2^2+.5^2=4.25$, bigger than $2$, that of the sphere. We have to reduce the ellipse1 size approximately by $.686$ (under $k=4$) in order to get ellipse2 data with the same overall variance as the sherical's: $(2 \cdot .686)^2+(.5 \cdot .686)^2 \approx 2$.

Multivariate sum of principal st. deviations (sum of singular values): $r_1+r_2$. In ellipse1, it is $2+.5=2.5$, bigger than $2$, that of the sphere. We have to reduce the ellipse1 size by $.8$ (under $k=4$) in order to get ellipse3 data with the same overall st. deviation as the sherical's: $(2 \cdot .8)+(.5 \cdot .8) =2$.

Cloud's circumference (surface area): $2\pi r$ for circle and (good approximate formula) $\pi[3(r_1+r_2)-\sqrt{(3r_1+r_2)(r_1+3r_2)}]$ for ellipse. It amounts to $2.730\pi$ in ellipse1, bigger than $2\pi$, that of the sphere. We have to reduce the ellipse1 size approximately by $.733$ (under $k=4$) in order to get ellipse4 data with the same overall surface as the sherical's.

And so, we have 4 same-shape but different-size ellipse data clouds each of which is equivalent to the sperical data cloud in some particular aspect, not in all aspects at once.

And my silly question is, as put at the beginning: in your simulation practices, when would you prefer ellipsoid1 / ellipsoid2 / ellipsoid3 / ellipsoid4 / some ellipsoidX, as "equivalent by properties" to a given spherical data? To hint of an example, you might be exploring the behaviour of some multivariate statistical analysis or machine learning technique or some multivariate index/statistic, its behaviour towards (a collection of) spherical vs ellipsoid data cloud(s). I.e. its reaction or sensitivity to sheer shape. That means you need datas which differ only with respect to the shape - i.e. the coefficient $k$ ($1$ for spherical and some selected value $>1$ for elliptical), - and "all the rest properties being kept equal", to claim it intuitively right. But we saw (I illustrated it) that "all properties" cannot be equal, we have to choose which. For examining what kind of "statistical/learning techniques" or "indices/statistics" will you choose this or that cloud's property to be equivalent b/w a sphere and an ellipsoid (normal distribution, but not necessarily 2d data)? Please share your reasoning, intuitive or rigorous as you like.

(P.S. In all four ellipses above there is one property though which is the same and is identical to that of the sphere: the Mahalanobis distances. Therefore procedures basing themselves of such property of data as Mahalanobis distances or their sum will be blind to the dissimilarity between the five clouds. But Mahalanobis distance isn't very interesting issue for my question since it, by definition, just "removes away" any elliptical shape of a data cloud.)

Select ellipsoid data cloud "equivalent" to a spherical data cloud

0 Answers0