1

Preface: I am not a mathematician. I have a basic understanding of statistics but am not familiar with mathematical notation beyond high school algebra.

I am a software developer maintaining medical records forms for a clinic. We have a form that calculates pediatric patients' height, weight, BMI, and other growth-related percentiles. In the process of troubleshooting a probably unrelated issue with the percentile calculations, I encountered this function that I do not understand.

This function takes only the z-score as an argument and returns a number percentile. I was able to confirm that the method to derive the z-score is correct (it is based on the LMS method outlined in this nature.com article), but I don't understand what is going on in the percentile calculation.

Code in MEL, a scripting language created in the mid 90's by GE:

local p10
local p9
local p4
local p3
local p2
local p1
local p

p10 = 1 + (0.33267 * AbsValue(zScore))
p9 = 1 / p10
p4 = (0.4361836 * p9) - (0.1201676 * (p9 ^ 2)) + (0.937298 * (p9 ^ 3))
p1 =  2 * 3.14159265
p3 = (AbsValue(zScore)^2)
p2 = -1 * p3 / 2

local pp
pp = 1 / (p1 ^ 0.5) * (2.71828182845905 ^ p2) * p4
p = 1 - pp

if zScore > 0 then
    p = p * 100
else
    p = 100 - p * 100
endif

p = div((p + 0.5), 1)

return p

Is the calculation correct? (I know it's at least close, I can cross-check against a table.) If it is correct, is it specific to the LMS growth chart tables (from the CDC/WHO) in some way, or can it be generalized? Can anyone offer any insight into what is going on here? Even if it is all correct and should not be changed, I would like to name the variables more descriptively and add some comments about what calculations are being done and why.

I have been searching on this question all day and can find only either a) explanations on how to find a percentile by looking a z-score up on a table and b) explanations involving formulas full of unfamiliar notation that I don't even know where to start to understand.

  • I don't think the issue I was troubleshooting is related to this code, but it's hard to be sure without understanding the parts.
  • I found this question but looking through the links in the answer and comments there did not lead me to anything I found useful.
Shayan Shafiq
  • 633
  • 6
  • 17
coppereyecat
  • 113
  • 4
  • 1
    What you have there is the normal cumulative distribution function, it is not specific to the application domain. It's defined by an integral for which the exact solution has no "closed form"; this is a specific approximation to it, which you can find in the classic reference of Abramowitz & Stegun (1964). This is 26.2.16, under "polynomial and rational approximations for $P(x)$". The most useful naming scheme here is probably to copy exactly the notation of A&S ($t$, $p$, $a_1$,...) and include the reference in a comment (that way, you can immediately tell if it's correct or not). – Chris Haug Dec 09 '21 at 03:48
  • 1
    I should add that any statistical library will offer a better approximation than this one that you should probably use instead (even A&S had better ones, see 26.2.17 for example). You should not in general "roll your own" version of these important statistical functions for the same reason that you wouldn't create your own cryptography library. – Chris Haug Dec 09 '21 at 03:52
  • @ChrisHaug Thanks! if you want to submit that as an answer I would be happy to accept it, especially with a link to the reference if available. I would love to use a proper statistical library but MEL barely exists as a language, it wasn't really designed for what we're using it for here, and there are no libraries, etc for it. Eventually the idea is to migrate the form to Javascript at which point I will have more options. – coppereyecat Dec 09 '21 at 16:35
  • I was able to find this online ( [here is one source](http://www.convertit.com/go/educationplanet/reference/ams55.asp?Res=150&Page=932) ) but I am unclear as to what the symbol that resembles a cross between a lower case and upper case E is. I will see if it becomes more clear by renaming and rearranging the function to more closely resemble the format presented, but it seems to be a character with an inherent meaning not given in the surrounding text, what is it? – coppereyecat Dec 09 '21 at 17:59
  • I gather it must have something to do with Euler's number which appears in the function, but I don't see how the function gets the lines involving p1 (2*pi), p3 (z-score^2), and p2 (negative p3 / 2) from 26.2.16. I guess I have progress, but not a complete solution yet. – coppereyecat Dec 09 '21 at 18:17
  • You mean $\epsilon(x)$ (epsilon)? If so, that's the error term, as a function of the input. The exact value is $P(x)$, and then everything on the right hand side but excluding $\epsilon(x)$ is the approximation you're using (i.e. $\epsilon(x)$ is the difference between the true value and the approximation). The next line says that the error for this approximation is at most $10^{-5}$ in absolute value. – Chris Haug Dec 09 '21 at 18:31
  • Also see 26.2.1 on the previous page where $Z(x)$ is defined. – Chris Haug Dec 09 '21 at 18:33
  • Ah, I misunderstood what Z(x) was, that solved it. If you'd like to compile your comments into an answer I would be happy to accept it, that is all the information I was hoping for! – coppereyecat Dec 09 '21 at 20:57

1 Answers1

1

Expanded from comments:

The function you're after is the (standard) normal cumulative distribution function. It's not specific to the application domain you mention, it's just the probability that a standard normal random variable is less than or equal to some fixed value. You're answering the question: if these "z-scores" were drawn from a standard normal distribution, what would be the probability of observing a z-score which is smaller or equal to this one?

This function is defined mathematically by an integral:

$$P(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{-y^2/2}dy$$

This has no closed form. That is, there is no exact formula for $P(x)$ involving only usual operations like sums and products or functions like exponentials, logarithms, trigonometric functions, etc. To implement this function in software, you need an approximate solution which you can actually compute and which is close enough to the true solution for your needs.

This function is ubiquitous in many fields and the problem of approximating it in software has been extensively studied. The classic reference on the topic of approximating mathematical functions is Abramowitz & Stegun (1964), Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables.

The specific approximation you show is given there as equation 26.2.16. Note that the approximation is defined only for $x \geq 0$, but you actually want to be able to compute $P(x)$ for any real $x$. You can extend it for negative $x$ by taking advantage of the symmetry of the distribution, which implies (26.2.5-26.2.6):

$$P(-x) = 1-P(x)$$

So, if the input is negative, flip the sign, compute the approximation on that, then return one minus the result (that's what AbsValue and the if/else at the end are doing in your implementation).

In their notation, they first define (the normal probability density function):

$$Z(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}$$

and the transformation:

$$t=\frac{1}{1+px}$$

The approximation is then given by:

$$1-Z(x)(a_1t+a_2t^2+a_3t^3)$$

$p$, $a_1$, $a_2$, $a_3$ are specific numerical constants given in the text. They have been carefully chosen so as to minimize some error criterion between the approximation and the true value. An absolute error bound is also given:

$$|\epsilon(x)| = |P(x) - \left[1-Z(x)(a_1t+a_2t^2+a_3t^3)\right]| < 10^{-5}$$

That is, this approximation always yields an answer which is within $10^{-5}$ of the true value. Note that this is an absolute bound, and it may not be very useful to have such a bound if you are looking at very small probabilities. More modern statistical software will use a more accurate approximation in general, or combine several approximations where each is better adapted to different ranges of the input parameter to get accurate results across all possible inputs.

Chris Haug
  • 4,893
  • 1
  • 17
  • 24