2

Is it possible to calculate or find out what the original distribution was of a dataset?

For example: I have (part of) a dataset with 800 weights and I know that the original dataset contained 1000 weights and that 20% of the heaviest weights where excluded from the dataset I have.

I wonder if it is mathematically possible to find the original distribution of the full dataset? And if this is possible which mathematical or statistical formula can be used? Or are there are packages or function that can do that in R?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • In general no, not unless you are willing to make some assumptions about the missing data, which is dangerous. – user2974951 Nov 24 '21 at 07:39
  • Thank you for your answer. I will make the assumption (based on theory from other authors) that my data is normally distributed. –  Nov 24 '21 at 07:44
  • 2
    This sounds a lot like truncation. If know the cut-off value (e.g. the weight of the heaviest non-excluded unit), the moments of the non-truncated distribution have closed-form solutions. Have a look at https://en.wikipedia.org/wiki/Truncated_normal_distribution – Otto Kässi Nov 24 '21 at 08:38
  • Please search our site for the keywords [maximum likelihood truncation normal](https://stats.stackexchange.com/search?q=maximum+likelihood+trunc*+normal). The duplicate was the first hit, but I'm sure many other hits have useful information. Arguably, these are *censored* data--it depends on how the dataset cutoff was determined--in which case https://stats.stackexchange.com/questions/354671/fitting-distributions-on-censored-data provides solutions. – whuber Nov 24 '21 at 14:48

2 Answers2

1

If you can safely assume that your underlying data is normally distributed, then as Otto Kässi writes, you have a truncated normal distribution. If you know where it was truncated, this is good (and with 800 data points below the point of truncation, simply using the maximum observation will likely be a sufficiently good estimate of it, and any uncertainty here will likely be dominated by the uncertainty in your normality assumption).

There are a few R packages that deal with the truncated normal (e.g., truncnorm and TruncatedNormal), but these only offer densities, random generation and so forth. You could in principle try fitdistrplus::fitdist() with distr="truncnorm", but the following code crashes my R (see also here):

library(truncnorm)
library(fitdistrplus)
data <- c(35,12,10.5,9,8.8,8.5,7.8,7.2,6.8,6.5,6.2,6,5.8,5.5,5.2,5.1)
fitdist(data, "truncnorm", fix.arg=list(a=5),
    start = list(mean = mean(data), sd = sd(data)))

An alternative would be Crain (1979), which sounds promising based on the abstract but which I unfortunately do not have access to.

Estimating mean and st dev of a truncated gaussian curve without spike gives further possibilities.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
0

You can try Monte Carlo simulation:

  • Generate a large number of datasets with the properties that you know about (initially normal, truncated 20%, anything else you know or can reasonably assume?)
  • Use some goodness-of-fit measure to compare these datasets to the one given (you can start with simply plotting and eyeballing)

Focus on the things you know about the original dataset, prior to truncation. What was the underlying phenomenon that was being measured? (For example, if it is a sum of a large number of random events, irrespective of their distributions, the sum will always be normal, according to the central limit theorem.)