How to Assess the Fit of Thousands of Distributions?

Question

I have thousands of subjects, for each of whom I have a fitted gamma distribution, with the parameters estimated from each subject's data. It is easy to look at the distribution for one subject (say qq-plots, etc.) to get an idea of how good is the fit. But how can I do this on a large scale, for all subjects?

When you say 'test' do you mean 'assess' (as in 'how far from gamma does this seem to be?') or do you intend some hypothesis test (which I'd advise against ... and doubly so if you're doing a bunch of them). — Glen_b, Mar 24 '15 at 01:20
Are you talking about a single measure over all subjects or one per subject? — Glen_b, Mar 24 '15 at 01:24
Ideally, one per subject, but I can live with a single measure. — user765195, Mar 24 '15 at 01:27
Could you not use a $\chi^2$ goodness of fit test per subject? — Chris C, Mar 24 '15 at 01:39
@ChrisC Isn't $\chi^2$ for discrete random variables? Here we have a continuous distribution. — user765195, Mar 24 '15 at 01:47
$\chi^2$ is used for continuous variables too by using binning. You can apply KS-statistics or similar measures, even AIC (assuming you fit distributions with MLE). — Aksakal, Mar 24 '15 at 01:51
You can't really use KS here, because the parameters for the NULL distribution need to pre-specified in KS. You can generate the NULL distribution using Monte-Carlo, but that's something that I am trying to avoid. — user765195, Mar 24 '15 at 02:03
@ChrisC Chi-square testing is the wrong way to go here, not because it cannot be applied, but because it entails binning and does not use all the information in the data (notably neglect of even the order of the bins). It's an embarrassment that there are still texts recommending this approach for examining the fit of continuous distributions. — Nick Cox, Mar 24 '15 at 11:44
@NickCox This is very useful information, thank you for letting me know. I had unfortunately read one of those books! — Chris C, Mar 24 '15 at 12:02

score 5 · Accepted Answer · edited Mar 24 '15 at 13:00

5

I'll suggest using representative plots. Pull 16 or 20 subjects, and show their QQ-plots in 4x4 or 4x5 chart. Sometimes, you can plot several subjects in the same plot. This doesn't substitute other ways of representing the fits, but on the other hand I don't think you can avoid this step either. It's used a lot in panel (longitudinal) data analysis. You really need to see the representative plots.

See, Fig.12-1.3 in this book. It's not the distributions, but the same idea: show the sample plots for subjects.

You can get fancy and draw 3d plots, of course, or contour plots, where x-axis is subject, but these are sometimes hard to analyze visually. They may reveal important patterns though.

UPDATE You can also show the histogram of Kolmogorov-Smirnov statistics. It's true that the critical values are expensive to compute, but the statistics itself is easy to compute. So, you can obtain KS-stat for each subject, and show the histogram of obtained values. This will give you a great visual cue as to how the gamma distribution fits in general. It's almost like bootstrapping.

edited Mar 24 '15 at 13:00

Nick Cox

48,377
8
110
156

answered Mar 24 '15 at 01:50

Aksakal

55,939
5
90
176

1

Agreed with the idea of sampling: what could be more statistical? You could also (1) look at the distributions of parameter estimates (two each?) and check that any extremes make sense in terms of the original data (2) automate corr(observed quantiles, expected quantiles) and check very poor fits (and extremely good ones!) – Nick Cox Mar 24 '15 at 02:20
@NickCox, I updated the answer with histogram of KS-stats. – Aksakal Mar 24 '15 at 12:48
I agree with Rupert Miller on K-S statistics. See p.9 in http://www.stata.com/manuals13/rdiagnosticplots.pdf for a quotation. His context was normality testing, but I think the point carries over here. On K-S testing, regardless of that view, newcomers need the emphasis that the standard test needs modification when parameters are estimated from the data. – Nick Cox Mar 24 '15 at 12:59
1

A variant of this idea is to make confidence bands around the empirical distribution function, based on the KS statistic (ecdf.ksCI in package sfsmisc), and then plot the fitted gamma distributions to see if they generally falls within the bands. – kjetil b halvorsen Mar 24 '15 at 15:03
The most practical, and really the best answer, for me is the first comment by @NickCox, but since he didn't put his comment as an answer, and he approves of this answer, I am accepting it as the correct answer. – user765195 Mar 26 '15 at 23:07
@user765195 Thanks for the remark, and that's fine by me. – Nick Cox Mar 26 '15 at 23:09

score 4 · Answer 2 · edited May 23 '17 at 12:39

I hope that understood your situation and the question correctly. Considering your data set's number of distributions, visual exploratory approaches (such as QQ plots, which you mentioned) are not feasible in this case. Therefore, you have to resort to analytical approaches, such as goodness-of-fit (GoF) tests, as some have already mentioned in the comments above.

Since you have informed that distribution parameters are estimated from data, I assume that you have used or plan to use one of distribution fitting approaches. One of the most popular fitting approaches (along with least squares, to a lesser degree) is maximum likelihood estimation (MLE), which is generally easy to perform, for example, using function fitdistr() from R package MASS. However, depending on your particular data, fitting via fitdistr() might not be so trivial. Some people prefer R package fitdistrplus, as they consider it more advanced or useful.

After this straightforward step, you need to validate the estimation results, using one or more of the following GoF tests for continuous data (considering their pros and cons): chi-square (via binning), Kolmogorov-Smirnov (via corrected tables for critical values or Monte Carlo simulation, which I'm listing here just for completeness, as you are trying to avoid this), Anderson-Darling, Lilliefors, Cramér–von Mises and Watson. In terms of performance, the problem gets reduced to performing a relatively large number of non-parametric GoF tests, which IMHO is achievable either via doing it on a more powerful hardware (i.e., renting Amazon EC2 virtual instance), or via parallelizing code.

Returning to the essence of your question, my idea of possible approaches is to aggregate results either via bootstrapping (similarly to the one presented in this excellent answer), or some kind of averaging approach, similar to ensemble methods (for example, take a look at this research paper).

I disagree partly with the emphasis in the first paragraph. What's impractical is looking at **every** QQ plot; that doesn't rule out looking at **some** and automating a search through QQ plot results. @Aksakal's answer illustrates well. Thousands of significance tests too raise problems of multiplicity and of checking wild or puzzling results. You will be aware of that, but it needs emphasis. — Nick Cox, Mar 24 '15 at 11:41
@NickCox: I have nothing against your suggestion of looking at some QQ plots for _initial_ **exploratory** assessment (trends, outliers, etc.), however I don't think that such visual EDA methods can be automated by definition. That is, unless one will use AI/ML to process plots, which is likely much more difficult and resource-intensive that analytical approaches. I agree with you that one needs to be careful with _scaling_ analytical methods, but I think that it still is easier and more accurate to come up with good **strategies** for _analytical solutions_. — Aleksandr Blekh, Mar 24 '15 at 12:05
As before, this boils down to emphasis, not disagreement. I think that any automated analysis would, or should, need to be **followed** also by looking at the poorest fits to see what is happening. — Nick Cox, Mar 24 '15 at 12:10
@NickCox: I agree. I like Aksakal's idea of QQ plots sampling and your thoughts on some manual analysis (and maybe intervention), but they just represent exploratory approach. My answer, as you noticed, emphasizes analytical one. Certainly, the optimal (and correct) way of doing things is combining both approaches so that they would "inform" each other and, ultimately, inform decision-maker. — Aleksandr Blekh, Mar 24 '15 at 12:16

How to Assess the Fit of Thousands of Distributions?

2 Answers2

Linked

Related