Determining probability distribution for datasets with missing values

Question

As a part of my exploratory data analysis (EDA) prior to further analysis, I'm trying to determine a probability distribution of my pilot dataset's variables. A particular feature of this dataset is a significant share of missing values. I partially alleviated this problem by performing multiple imputation (MI), using Amelia R package. The MI process resulted in reduction of missing values from 98% to 31%. If it's important, further analysis includes EFA, CFA and SEM-PLS modeling.

I have several questions in this regard. First, and, probably, main, question is: What is the correct (or optimal) approach to distribution fitting in terms of using parametric versus non-parametric methods? Another question is: Does it makes sense to combine both approaches for validation? The final question is: How presence of missing data influences approaches for distribution fitting?

The following are some of my thoughts, based on reading relevant discussions on CrossValidated. I apologize in advance, if they (thoughts) don't display high level of statistical rigor, as I'm not a statistician, but software developer turned social science researcher and aspiring data scientist.

In his answer to this question, @Glen_b suggests that, given large sample, non-parametric approach is easier and better, or, at least, not worse. However, it's not clear to me whether this rule of thumb has any "contraindications", so to speak. It is also not clear what is the consensus, if any, in regard to usefulness of performing automatic or semi-automatic process of distribution fitting.

In this great discussion, @Glen_b demonstrates investigating real data distribution via applying some transformations. In this regard, if the distribution is not multimodal, but just heavily skewed, it's not clear whether it makes sense to determine data distribution versus simply transforming data to conform normal distribution, using Box-Cox transformation.

In this discussion, @jpillow recommends, along with using Q-Q plots, Kolmogorov-Smirnov statistical test. However, in his paper "Fitting distributions with R", Vito Ricci states (p. 19): "Kolmogorov-Smirnov test is more powerful than chi-square test when sample size is not too great. For large size sample both the tests have the same power. The most serious limitation of Kolmogorov-Smirnov test is that the distribution must be fully specified, that is, location, scale, and shape parameters can’t be estimated from the data sample. Due to this limitation, many analysts prefer to use the Anderson-Darling goodness-of fit test. However, the Anderson-Darling test is only available for a few specific distributions." Then, there are Shapiro-Wilk and Lilliefors tests. Then there is the above-mentioned chi-square test, which can be applied to non-continuous distributions. Again, I'm rather confused in terms of decision-making process for selecting tests that I should use.

In terms of distribution fitting (DF), I have discovered several R packages, in addition to the ones mentioned in the paper by Ricci and elsewhere, such as 'fitdistrplus' (http://cran.r-project.org/web/packages/fitdistrplus) for non- and parametric DF and 'kerdiest' (http://cran.r-project.org/web/packages/kerdiest) for non-parametric DF. This is an FYI, for people who haven't heard about them and are curious. Sorry about the long question and thank you in advance for your attention!

You need to differentiate between missing vs censored. I assume you actually mean missing - in which case, you only had 2% of your dataset before imputation?! If your data are just missing, how many samples did you actually have (not percent, but raw count). Do you know WHY the other 98% were missing? If they were missing completely at random, then just ignore the missing data and fit what you have. If there is selection bias in the way the data are missing, you will need to account for this. With so much missing data, your research may not go very far. — , Aug 22 '14 at 18:44
@Eupraxis1981: Appreciate your help. When I say "missing", I mean missing values (`NA` in `R` terms). This particular dataset is a pilot one and N=100000. I have access to a dataset with N~1.8M. In terms of the reasons of the data missing, I can say the following. The topic of my research is modeling success of open source software (OSS) development, so I use open data. My guess is that reasons for missing data are mostly lack of desire, time or needed info. However, it can be argued that missing data correlate with projects with(in) certain characteristics (categories) and, thus, is not MCAR. — Aleksandr Blekh, Aug 22 '14 at 19:09
Do you happen to have data on at least some of each relevant category? I'm thinking you may want to do stratified sampling to avoid bias and make imputation more accurate. — , Aug 22 '14 at 21:41
@Eupraxis1981: Sorry, I'm not sure I understand. What categories are you talking about? Could you clarify this point? — Aleksandr Blekh, Aug 22 '14 at 21:50
Per your comment above: "...it can be argued that missing data correlate with projects with(in) certain characteristics **(categories)** and, thus, is not MCAR." — , Aug 22 '14 at 22:02
@Eupraxis1981: Oh, sorry. Should have figured it earlier. I'm not sure how to approach this, as OSS projects might be categorized by different criteria. Thus, the taxonomies would be different. Moreover, I was trying not to deviate from simple random sampling to ensure maximum validity. I understand that sometimes other sampling methods, such as stratified, are used, but I haven't seen clear guidelines and information on how to assess validity of these methods. — Aleksandr Blekh, Aug 22 '14 at 22:04
The problem is that if you suspect that the data are not MCAR, then your simple random sample will NOT be valid. You need to understand the subpopulation(s) that are most likely to be missing and ensure you have a good enough sample of them. Then, you can correct your estimates to account for each strata's proportion of the overall population. As it stands, all you have are 2,000 biased datapoints, and you add on an additoinal 36,000 biased datapoints using imputation on a a biased dataset. This will likely get you some serious questions from your lead investigator. — , Aug 22 '14 at 22:09
@Eupraxis1981: Understood. So, this is an issue of choosing the correct **sampling** method for data sets with (large ratio of) missing values. If I understand correctly, this issue is completely separate from another issue, mentioned in my question - **distribution fitting** (DF): both a general approach and an approach to DF for data sets with missing values. Correct? — Aleksandr Blekh, Aug 22 '14 at 22:17
Correct. You need to know that you are fitting distributions to representative samples before you do anything else. — , Aug 22 '14 at 23:15
@Eupraxis1981: All right! I have to go offline soon, also don't want to bother you too much today. Thank you for your feedback! I might have more questions in the coming days, if you don't mind. — Aleksandr Blekh, Aug 22 '14 at 23:45

score 4 · Answer 1 · answered Aug 23 '14 at 09:13

4

What is the correct (or optimal) approach to distribution fitting in terms of using parametric versus non-parametric methods?

There won't be one correct approach, and what might be suitable depends on what you want to "optimize" and what you're trying to achieve with your analysis.

When there is little data, you don't have much ability to estimate distributions.

There is one interesting possibility that sort of sits between the two. It's effectively parametric (at least when you fix the dimension of the parameter vector), but in a sense the approach spans the space between a simple parametric model and a model with arbitrarily many parameters.

That is to take some base distributional model and build an extended family of distributions based on orthogonal polynomials with respect to the base distribution as weight function. This approach has been investigated by Rayner and Best - and a number of other authors - in a number of contexts and for a variety of base distributions. This includes "smooth" goodness of fit tests, but also similar approaches for analysis of count data (which allow decomposing into "linear", "quadratic" etc components that deviate from some null model), and a number of other such ideas.

So for example, one would take a family distributions based around the normal distribution and Hermite polynomials, or uniforms and Legendre polynomials, and so on.

This is especially useful where a particular model is expected to be close to suitable, but that the actual distribution will tend to deviation "smoothly" from the base model.

In the normal and uniform cases the methods are very simple, often more easily interpretable than other flexible methods, and often quite powerful.

Does it makes sense to combine both approaches for validation?

It would often make sense to use a nonparametric approach to check a parametric one.

The other way around may make sense in some particular circumstances.

answered Aug 23 '14 at 09:13

Glen_b

257,508
32
553
939

Appreciate your answer! Does the "in-between" method that you've mentioned have an **implementation** as `R` function or package? Actually, I start **questioning** whether I really do need *distribution fitting* **at all**, as my ultimate goal is to perform full SEM analysis. Since I have chosen a *PLS-SEM* analysis, which "doesn't impose any distributional assumptions on the data" (Sanchez, 2013, p. 34). **References:** Sanchez, G. (2013). *PLS Path Modeling with R.* Retrieved from http://gastonsanchez.com/PLS_Path_Modeling_with_R.pdf. – Aleksandr Blekh Aug 23 '14 at 14:44
In regard to **amount of data** that I have, here's the situation. Because, as I've mentioned in discussion above with @Eupraxis1981, I have a lot of data (on the order of magnitude of 10^5-10^6 records), but large part of it contains missing values. So, it seems to me that I'm facing three separate(?) **challenges**: to select the **appropriate** *sampling method*, considering missingness patterns; 2) to handle *missing data* issue (that seems to be easy with `MI` and `Amelia` package; 3) to devise a *distribution fitting* strategy and implement it (if I need it at all, per my comment above). – Aleksandr Blekh Aug 23 '14 at 15:01
Not only are there some packages, there's a whole book - at least on the testing side (but fitting is a natural part of the testing process) - [*Smooth tests of Goodness of Fit Using R*](http://au.wiley.com/WileyCDA/WileyTitle/productCd-0470824425.html). Some packages: (1) [here](http://www.biomath.ugent.be/~othas/smooth2/R-Package.html) (I don't know if it works in recent versions of R) also the package `ddst` on CRAN is related (see the references in its manual). There are likely other such packages, though the Gaussian and uniform ones are very easily implemented. – Glen_b Aug 23 '14 at 22:44
Some further references: [Cosmo Shalizi's notes](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch17.pdf). For an example of a related paper, see Nair, V. (1987) "Chi-Squared-Type Tests for Ordered Alternatives in Contingency Tables", JASA, **82**:397, 283-291 (there are many such papers, but I wanted to give one with a different author than ones already mentioned). – Glen_b Aug 23 '14 at 22:45
Thanks a lot for the info! Would appreciate your opinion on issues, mentioned in my previous comment. – Aleksandr Blekh Aug 23 '14 at 23:26
So, Glen_b what polynomials would you use for Extreme value distributions, lognormal, or gamma? thx! – Oliver Amundsen Nov 05 '14 at 23:40
1

I think we're getting firmly into "ask a new question" territory here (and that might go better on math.SE), but for the gamma, there's the Laguerre polynomials; for the lognormal, while it has orthogonal polynomials (which aren't all that hard to track down), for that case I'd just take logs and use Hermite. For the extreme value distributions, it depends on which one you mean; I'd have a suggestion for the Gumbel which might work. I don't know what they might be for Weibull of Frechet. – Glen_b Nov 06 '14 at 01:01
What would be the suggestion for the Gumbel? Also, would you please recommend a textbook/tutorial on the topic of orthogonal polynomials and PDFs as the topic is new to me. Thanks!! – Oliver Amundsen Nov 06 '14 at 11:40
1

I posted the question in: http://math.stackexchange.com/questions/1008908/use-orthogonal-polynomials-to-findfit-probability-distribution-to-missing-data – Oliver Amundsen Nov 06 '14 at 12:21

Determining probability distribution for datasets with missing values

1 Answers1

Linked