Imagine we have a partial data and we know that this partial data represent only the left 5% of the log-normal distribution, which the overall data follow. How can we calculate the mean-log and sd-log of the overall log-normal distribution having this partial data? The answer in R is preferable.
Asked
Active
Viewed 252 times
5
-
5There are some standard ways to do this, including maximum likelihood and regression on order statistics. But if by "left" you mean the *lower* five percent, then you should be concerned about the potential for enormous standard errors. Do you have a very large dataset? How confident are you that the upper 95% truly is lognormally distributed? – whuber Jun 17 '15 at 14:14
-
1My specific example is about tasks execution durations (which do follow log-normal distribution with _p_ close to 0.8) on the micro-task crowdsourcing platform ([CrowdFlower](http://crowdflower.com)). As all tasks start at the same time it is obvious that I receive the results for the fast tasks first. My goal is to predict the 95% percentile duration time, having only data describing the lower _k_ % (where _k_ is 5 or 10 or 20) of the distribution. I can not call my dataset very large - the overall dataset is about 500 items only. – Pavel Kucherbaev Jun 17 '15 at 21:20
-
@whuber, is it possible for you to introduce an example of code to use maximum likelihood and regression on order statistics in this specific usecase? – Pavel Kucherbaev Jun 18 '15 at 07:46
-
1See my answer at http://stats.stackexchange.com/questions/130156. It includes `R` code illustrating both ML and ROS. The `R` package `NADA` includes ROS capabilities. The free USEPA software [ProUCL](http://www2.epa.gov/land-research/proucl-software) incorporates ROS. (It's so cumbersome that it's of little use for extensive data, but it could be valuable for double-checking of calculations.) – whuber Jun 18 '15 at 11:16