I'm looking for an approximation to the curve of a lognormal distribution, for use in non-linear regression against a dataset. As an alternative, I'm interested in an approximation to the CDF thereof.
I have several goals:
- Determine how closely a sample dataset drawn from a process of unknown distribution matches a lognormal distribution.
- Given a set of samples drawn from a process determined or known to be lognormally distributed, but with unknown characteristics, determine the probability of a observing a future sample with a given value that may lie outside the range of values observed thus far.
- Given a set of samples as above, and a new sample that may lie outside the range of values in the samples already seen, determine the likelihood of having observed that sample.
- While computing the above, efficiency and simplicity of the implementation is quite important. A fast approximation with well-understood properties such as error bounds is better than an accurate algorithm that is difficult to implement or computationally intensive.
For these purposes, I think it is best if the approximation has a closed form so that a nonlinear regression can be done with lower computational overhead. Ideally if it is an approximation to the lognormal distribution itself, then it would be nice if it has a simple integral as well so that the CDF can be approximated too.
It's possible that I'm trying to go about this the wrong way. Here's a more specific question to help figure that out: suppose I gather 1000 samples from my process. The values of the samples mostly range from 1 to 10, with occasional samples up to 20 or so. I know (based on experience) that long-term it is quite possible to see samples in the range of 10 times higher than that, but I haven't observed any from this process yet. How can I determine the probability of the 1001st sample having a value greater than 100 or another arbitrary number? If the 1001st sample's value is 180, how can I determine how likely that was, based on the first 1000 samples?