2

Suppose we have a small set of numbers (5 to 10 observations), and we’re trying to fit a distribution to this set. Also, we know that all numbers are positive. I tried to fit lognormal, but I’m not sure how good my estimates are since the sample is very small; also I’m not sure how whether it is enough to look at goodness-of-fit test due to the small sample size.

Any suggestions on how to tackle this issue (i.e., to be more confident, certain, about my estimates)?

user9292
  • 1,324
  • 2
  • 17
  • 31
  • It's not clear if you're asking about how to assess whether the distribution is lognormal or about the uncertainity of the estimates of the lognormal parameters. In the former case, graphical investigation, using for instance boxplots and histograms, can give you a _very_ rough idea about the distribution, but apart from that, there really isn't much that you can do with only 5-10 observations. – MånsT Jul 10 '12 at 15:06
  • my main concern is not the distribution, but how good my estimates are assuming i know the distribution... – user9292 Jul 10 '12 at 15:24
  • 1
    That, at least, is possible to make statements about. :) It would help if you told us which estimators you used for the parameters. – MånsT Jul 10 '12 at 15:40

1 Answers1

7

I would not recommend using a goodness of fit test for such small sample. For example, if you simulate $5-10$ observations from a log-normal distribution, then the Shapiro-Wilk normallity test would fail in the sense that the associated p-value would be higher than $0.05$ more than $30\% +$ of the times, failing to provide the desired power/signficance level. See the following R code.

count = rep(0,10000)

for(i in 1:10000){
x = exp(rnorm(10))
if(shapiro.test(x)$p.value>0.05) count[i] = 1 
}

mean(count)

You might consider Maximum Likelihood Estimation (MLE) and quantifying the accuracy of the estimation by constructing confidence-likelihood intervals for the parameters. One option consists of using the profile likelihood of the parameters $(\mu,\sigma)$.

In this case, the MLE of $(\mu,\sigma)$ for a sample $(x_1,...,x_n)$ are

$$\hat\mu= \dfrac{1}{n}\sum_{j=1}^n\log(x_j);\,\,\,\hat\sigma^2=\dfrac{1}{n}\sum_{j=1}^n(\log(x_j)-\hat\mu)^2.$$

Now, you can use the well-known result that a likelihood interval of level $0.147$ has an approximate confidence of $95\%$. The following R code shows how to calculate these intervals for $\mu$ and $\sigma$ numerically and how to plot the profile likelihoods for your sample.

# Your data
dat = c(0.6695,0.5968, 0.7641, 0.7252, 0.7779)
n = length(dat)

# Profile likelihood of mu
p.mu = function(mu){
muh = mean(log(dat))
return(  (sum((log(dat)-muh)^2)/sum((log(dat)-mu)^2))^(0.5*n)  )
}

# Plot of the profile
vec = seq(-0.75,0,0.01)
rmvec = lapply(vec,p.mu)

plot(vec,rmvec,type="l")

p.muint = function(mu) return(p.mu(mu)-0.147)

# Approximate 95% confidence interval of mu
c(uniroot(p.muint,c(-0.6,-0.4))$root,uniroot(p.muint,c(-0.3,-0.1))$root)


# Profile likelihood of sigma
p.sigma = function(sigma){
muh = mean(log(dat))
sigmah = sqrt(mean((log(dat)-muh)^2))
return(  (sigmah/sigma)^n*exp(0.5*n)*exp(-0.5*n*sigmah^2/sigma^2) )
}


# Plot of the profile
vec1 = seq(0.01,0.3,0.001)
rsvec = lapply(vec1,p.sigma)

plot(vec1,rsvec,type="l")

p.sigmaint = function(sigma) return(p.sigma(sigma)-0.147)

# Approximate 95% confidence interval of sigma
c(uniroot(p.sigmaint,c(0.05,0.1))$root,uniroot(p.sigmaint,c(0.15,0.3))$root)

I hope this helps.

  • Thanks for the help. I truly appreciate it. I'm going through it trying to understand it... I tried it on my sample, but it gave me 2 errors : Error in uniroot(p.muint, c(-1, 0)) : f() values at end points not of opposite sign Error in uniroot(p.sigmaint, c(0.4, 0.75)) : f() values at end points not of opposite sign – user9292 Jul 10 '12 at 16:40
  • here is my sample: dat – user9292 Jul 10 '12 at 16:42
  • @act00 Fixed for your sample! Please, have a look. –  Jul 10 '12 at 16:47
  • Thank you so much. Let me please go over to digest everything and will ask some questions. Thanks again!!! – user9292 Jul 10 '12 at 17:12
  • Isn't it good that Shapiro-Wilk rejects > 30% of the time, given that the sample isn't in fact Normal? And of course that number depends on the parameters of the lognormal, holding the sample size fixed, as the lognormal can be made arbitrarily close to a Normal. – jbowman Jul 10 '12 at 18:44
  • @jbowman As I mentioned, the p.value is greater than 0.05 (much greater), then it is actually **not rejecting** (at the 5% level) more than 30% of the time. –  Jul 10 '12 at 18:45
  • Hah! Chasing you across threads here... I see your point, it's a statement about power, or lack thereof. That "/significance level" threw me for a bit. – jbowman Jul 10 '12 at 18:50
  • I’ve two questions, and please correct me if I’m wrong. First, now if I want to calculate $P(x<0.80)$, how this can be done? Second, what if I decided to use t distribution (log t)? what should I change in the code. Thanks again. – user9292 Jul 10 '12 at 18:52
  • @jbowman Haha, nice one. Yeah, probably I did not explained properly. My Swahili is better than my English. –  Jul 10 '12 at 18:52
  • @act00 You can calculate this using the estimated parameters $P(X<80)=$`plnorm(0.8,muh,sigmah)`. If you want to use a log-t, then you would need to change a lot of things. For instance, you would need to obtain the MLE numerically because they do not exists in closed form as far as I know. This can be done but I think the accuracy would be worst because you have to include an additional parameter, the degrees of freedom. –  Jul 10 '12 at 18:56