6

There's a hidden variable that I'd like to approximate. I know several summary statistics: min, max, mean, median, standard deviation, n; and that it's approximately normal.

I can obviously do a plain normal distribution given the mean and standard deviation, but I know it's slightly skewed and limited tails. Obviously my approximation won't be perfect, but closer. I'll probably implement in R, but suggestions not tied to platforms are appreciated.

Example:

Xbar <- 101.73
Xmedian <- 97.7
Xs <- 20.45 (standard deviation)
Xmin <- 50
Xmax <- 160
n <- 148
chmullig
  • 171
  • 2
  • 9
  • 1
    The min and max will be more useful if you have the *count* of your data as well. Is that available? Is `Xs` the sample standard deviation? – whuber Mar 29 '13 at 22:54
  • Yes, I have n. And yes, Xs is standard deviation. Updated! – chmullig Mar 29 '13 at 22:59
  • This is not quite the same thing, but might be interesting anyway: [Chatterjee & Firat (2012). Generating Data with Identical Statistics but Dissimilar Graphics](http://amstat.tandfonline.com/doi/abs/10.1198/000313007X220057?journalCode=utas20#preview). – gung - Reinstate Monica Mar 29 '13 at 23:50
  • 1
    There are an infinite number of distributions which could reasonably produce these sample characteristics. Are your values limited or - at least effectively - unbounded; or perhaps bounded at one end)? Have you considered trying say a gamma distribution (possibly with a shift-parameter), which would let you fit that mild skewness? – Glen_b Mar 30 '13 at 00:28
  • I know it's roughly normal, and it's bounded between [50, 160]. – chmullig Mar 30 '13 at 00:51
  • I'm too lazy to write something up, but this is probably a [practical solution using ABC algorithms.](http://stats.stackexchange.com/questions/37729/how-to-do-estimation-when-only-summary-statistics-are-available/37734#37734) – Cam.Davidson.Pilon Apr 03 '13 at 01:11

3 Answers3

2

You must specify a model. You cannot estimate the model or generate a distribution function given the summary statistics. If you had the data, you could at best do non-parametric estimation, e.g. bootstrap or density estimation. Without the actual data you cannot do any non-parametric procedure--you must specify a parametric model. Given that you have sample moments, I suggest you pick a model and use method of moments to estimate it. If you don't know anything beyond that it's roughly normal just use a normal distribution, as you have no justification for using anything else.

guest47
  • 331
  • 2
  • 5
2

If you just want a distribution that looks approximately normal and satisfies your descriptive stats, here is one possible approach. Start with a normally distributed sample of 148 numbers and apply a series of transformations to (approximately) satisfy the descriptive stats. Of course, there are many distributions that could satisfy the problem...

# function for descriptive stats
stats = function(x)  c(min(x),max(x),median(x),mean(x),sd(x))

# simple power transformation (hold min and max constant)
pow = function(x,lam) {
   t = (x-min(x))^lam
   (t/max(t))*(max(x)-min(x))+min(x)
}

# power transform of upper and lower halves of data (hold min,max,median constant)
pow2 = function(par, x) {
    m = median(x)
    t1 = pow(m-x[1:74], par[1])
    t2 = pow(x[75:148]-m, par[2])
    c(m-t1, t2+m)
}


# transformation to fit minimum and maximum
t1 = function(x) {
   x = ((x-min(x))/diff(range(x)) *110) + 50
}

# optimise power transformation match median
t2 = function(x) {
   l = optimise(function(l) { (median(pow(x,l))-97.7)^2 }, c(-5,5))$min
   pow(x,l)
}

# optimise power transformation of upper and lower halves to fit mean and sd
t3 = function(x) {
    l2 = optim(c(1,1), function(par) { 
       r = pow2(par,x); (mean(r)-101.73)^2 + (sd(r)-20.45)^2 })$par
    pow2(l2, x)
}

d = t1(sort(rnorm(148)))
stats(d)
d = t2(d)
stats(d)
d = t3(d)
stats(d) # result should match your descriptive stats
hist(d)  # looks normal-ish

# repeat and plot many distributions that satisfy requirements
plot(d,cumsum(d), type="l")
for(n in 1:500) { 
   d = t3(t2(t1(sort(rnorm(148)))))
   lines(d,cumsum(d), col=rgb(1,0,0,0.05))
}
waferthin
  • 511
  • 3
  • 10
2

You could use a mixture of normals. Choose the smallest number of components which gets you close enough to the distribution you have in mind. "Close enough" is a matter for your judgement. Here's an example.

# Parameters of the mixture
p1 = 0.6
m1 = 95
s1 = 6
m2 = 103
s2 = 26

# Number of obs.
n = 148

# Draw the component indicators
set.seed(31337)
mix_indicator = rep(1,n)
mix_indicator[which(runif(n) > p1)] = 2

# Draw the normals
draws = rnorm(n)*s1 + m1
draws[which(mix_indicator==2)] = rnorm(sum(mix_indicator==2))*s2 + m2

print(mean(draws))    # 100.9
print(median(draws))  # 97.1
print(sqrt(var(draws)))  # 18.4
print(min(draws))     # 49
print(max(draws))     # 175
Jamie Hall
  • 101
  • 3