"Thinning" a sample until its distribution matches another distribution

Question

I am an astronomer studying galaxies. I have observations of about 500 galaxies. For each galaxy, I can measure two quantities (call them $X$ and $Y$).

I want to compare my observations to some theoretical (physical, not statistical) models. I have about 10,000 numerical models of all kinds of galaxies (describing their size, structure, and observable properties), some of which are quite different than the ones I am studying.

I want to "thin down" the sample of theoretical models until the distributions of $X$ and $Y$ across all the models look like the observed distributions of $X$ and $Y$. Is there a good algorithm to do this?

@jbowman My telescope is not sensitive to certain things, and so the sample of galaxies I have observed are not representative of the true sample of galaxies in the Universe. In astronomy we refer to this as the selection function. Galactic studies actually *usually* perform this step, but manually and often in some ad hoc way. So I was just wondering if the statisticians have a better way of doing it... — rhombidodecahedron, Jan 08 '19 at 14:57
So you are keeping all 500 galaxies and only making reductions in the set of 10,000 models? — rolando2, Jan 08 '19 at 14:58
@jbowman ...and to put another spin on it, perhaps about it this way: the set of models represents all the theoretical possibilities (regardless of their physical feasibility or actual existence), and the set of observations represents what nature actually produces. So reducing the models to the observable sphere gives us a way of understanding where in this giant parameter space our universe actually produces galaxies. — rhombidodecahedron, Jan 08 '19 at 15:00
Another Q: When you say "the observed distribution*s* of X and Y", do you mean the *single* X-Y distribution that describes all 500 galaxies? That is, you want to find which of the 10,000 models best fit that single observed bivariate distribution? — rolando2, Jan 08 '19 at 15:02
@rolando2 sorry, no, that is not what I mean. I want to find the subset of the 50,000 models such that when I plot the histograms of X and Y for the entire subset, they match (reasonably well) the histograms of X and Y for the 500 observed galaxies (... modulo normalizations, etc) — rhombidodecahedron, Jan 08 '19 at 15:05
Do each of your models generate a complete histogram or only a single value of X and Y? — jbowman, Jan 08 '19 at 15:13
Maybe you should edit the question to clarify exactly what a "model" is. In Statistics the term "model" usually refers to a probability distribution that generates data. — Bridgeburners, Jan 08 '19 at 16:39
@Bridgeburners thanks for the suggestion, I have edited the text. (I am referring to a physical model, not a statistical model.) — rhombidodecahedron, Jan 09 '19 at 08:59
Couldn't you view your set of physical models as describing (more or less) a space of *statistical* models in which observational error is incorporated? That would make your situation amenable to standard approaches, of which the most appealing might be Maximum Likelihood: simply compute the likelihood of your data for each of your models, identify the one that produces the largest likelihood, and select those those likelihoods are reasonably close to the largest one (using the usual chi-squared theory). — whuber, Jan 09 '19 at 14:46
@whuber That is certainly something someone could do with these models, but I do not think it is what I am after. I am not looking for the model that is the best fit to any particular observation. I want to find the subset of models that belong to the same population that these observations were sampled from. — rhombidodecahedron, Jan 09 '19 at 16:08
I'm having a very hard time making sense of that request, because I understand all models to differ: no more than one can "belong to the same population." Regardless, my suggestion was not to select the *best* fit, but to select a *subset* of models that all fit reasonably well *compared to the best fit.* — whuber, Jan 09 '19 at 16:24
@whuber Two samples, despite differing, can be drawn from the same population. I think we can agree on that. You and I are both samples from the population of Earth humans. Each model can be thought of as a sample. I think the word "model" might be overly confusing here, which I did not foresee. Each galaxy model is not itself a statistical distribution or anything like that; it's more like a mock observation/sample. — rhombidodecahedron, Jan 10 '19 at 11:47

score 0 · Answer 1 · answered Jan 08 '19 at 15:38

0

There are many statistical approaches to gauging the match between observed and theoretical distributions. Now, with thousands of distributions to evaluate, you will need to devise some automated procedures, each to be run as many times as there are models you want to test. If you use R or are willing to learn, this should be workable, perhaps with some advice from an expert.

To get you started in this effort, you could try the descdist() function in the R package fitdistrplus. It helps characterize a distribution. For a good, detailed example of a way it can be used, see this answer by COOLSerdash. It draws on the Cullen and Frey graphing technique as well as the more commonly used Kolmogorov-Smirnov test [available in base R via the function ks.test()].

Another, briefer source on descdist(), perhaps another good place to start, is bill_080's answer.

answered Jan 08 '19 at 15:38

rolando2

11,645
1
39
60

Hi @rolando2, thanks for your answer. Yes I am happy to program in R. I am familiar with the Kolmogorov-Smirnov test and others, but (as far as I know) what they will tell me is whether the distributions match or not (sorry if I am "handwaving" this, but you know what I mean). What I am wondering is how to select a subset such that the KS test (or another test) says that they match? – rhombidodecahedron Jan 08 '19 at 15:40
One algorithm I can think of is: while (True) select a random subset of models of random size, compute the KS test, and terminate if they match below a tolerance, otherwise repeat. However, this algorithm may never terminate. – rhombidodecahedron Jan 08 '19 at 15:42
Even if this algorithm terminates, the resulting subset is not necessarily distributed from the distribution of interest. – Xi'an Apr 16 '19 at 06:58

"Thinning" a sample until its distribution matches another distribution

1 Answers1