7

The probability density function (pdf) is the first derivative of the cumulative distribution (cdf) for a continuous random variable. I take it that this only applies to well-defined distributions like the Gaussian, t-distribution, Johnson SU, etc, though.

If given real data that we know does not conform to some prior distribution (perfectly), does that mean that (it would be safe to assume that) the real data's cdf cannot be differentiated, and therefore has no pdf, making us resort to histogram, or kernel density, or log-spline approximations, of the continuous data's pdf?

just trying to rationalize the whole model-fitting craze (Gaussian, t-, Cauchy) that is always encountered in statistics, and why it always overrides approximation approaches (histogram, kernel density).

In other words, rather than use an estimator on the empirical data (histogram, kernel density), we are trained to look for a best match model (Gaussian, t-, Cauchy) instead, even though we know the real data's pdf diverges from that model.

What makes the "modeling" approach better than "approximation"? Is it, and how is it, more right?

develarist
  • 3,009
  • 8
  • 31
  • Is your question: “why do we use parametric distributions instead of empirical distributions“? – Tim Sep 06 '20 at 14:47
  • yes i should say that – develarist Sep 06 '20 at 14:48
  • Convergence is faster for parametric distribution estimators than for non-parametric estimators, the higher the dimension the larger the advantage. – Xi'an Sep 06 '20 at 15:25
  • @Xi'an i would like to see this advantage, and by how much the advantage is. is there a source that actually demonstrates how increasing sample size will show parametric distributions to be more accurate for real data than its empirical distribution as the empirical distribution also increases with sample size? – develarist Sep 06 '20 at 15:56
  • We do use the ecdf -- e.g. in bootstrapping. – Glen_b Sep 06 '20 at 16:32
  • If I fit a parametric distribution to empirical data using maximum likelihood estimation of the parameters, why won't the moments ($\mu, \sigma$) of the parametric distribution match the moments of the empirical data? – develarist Dec 01 '20 at 11:44

1 Answers1

13

An enormous amount of data is needed to accurately estimate a distribution nonparametrically, especially a continuous one. Even then, some assumptions about the smoothness of the distribution are needed for filling the gaps (interpolating) between the observed values and other assumptions are needed for extrapolating outside the observed data range. With a small or moderate sample, you would usually expect poor accuracy from a nonparametric estimation. It would take a large discrepancy between the true distribution and a modelled parametric one used to approximate it to make the nonparametric approach more accurate. This is especially true in higher dimensions, as data become sparser when the dimension grows.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • is there a source that actually demonstrates how increasing sample size will show parametric distributions to be more accurate for real data than its empirical distribution as the empirical distribution also increases with sample size? – develarist Sep 06 '20 at 15:55
  • 2
    @develarist, I did not say increasing the sample size will make parametric distributions more accurate; it is rather the opposite. However, the sample sizes needed for accurate nonparametric estimation are very seldom. Also, the problem as you have formulated it is simply too broad to analyze; such general results may not be obtained. But you could show it for many separate, concrete cases. – Richard Hardy Sep 06 '20 at 15:57
  • then there's no evidence to support. i'd also be interested by *how much more* accurate the parametric distribution is than the empirical distribution for low samples, and then for high number of samples. single out the Gaussian vs empirical case if we have to de-broaden – develarist Sep 06 '20 at 16:00
  • 2
    @develarist, quite naturally there is no concrete evidence for many general claims, but each concrete case provides evidence that can be generalized. – Richard Hardy Sep 06 '20 at 16:01
  • 1
    "_An enormous amount of data is needed to accurately estimate a distribution nonparametrically, especially a continuous one_" And more especially a continuous and multivariate one! – leonbloy Sep 07 '20 at 02:24
  • @leonbloy, exactly, as I say it here *This is especially true in higher dimensions, as data become sparser when the dimension grows.*. – Richard Hardy Sep 07 '20 at 06:07
  • try the negation: With low number of samples, distributions cannot be accurately estimated non-parametrically, but with the same amount of data, (somehow) parametric distributions do a better job – develarist Sep 07 '20 at 06:33