2

In order to study k-nearestneighbors with a more concrete example than the iris dataset with my students, I would like to generate data for age weight sex based on the statistics of [cdc.gov for persons between 2 and 20 years.

I would like to generate a unique dataset based on three statistics:

  1. Weight-for-age charts
  2. Stature-for-age charts
  3. BMI-for-age charts

I don't really see how to satisfy this constrained problem.

First of all, I would like to know how to fit a statistical distribution to quantiles. And which statsitical distribution should I use(For one line of each file).

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Ben
  • 123
  • 3
  • 1
    The empirical distribution naturally fits all empirical quantiles. But simulating from this empirical distribution will not produce the same empirical quantiles. – Xi'an May 07 '20 at 06:20
  • 1
    Well, first of all this is paint by numbers not science. BMI has no physical relationship to anything (real physics, not make believe). Age and weight correlate but other factors, gender, diet, inheritance, life-style, sociological like cultural, exercise, etc. would be needed to narrow the dispersion in the data to useful levels. Which distribution of what depends, has to be measured, not guessed at. – Carl May 07 '20 at 07:23

1 Answers1

1

For the 24-month olds, the CDC has essentially given us the lines in this graph:

enter image description here

The vertical llnes show the actual 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 97th percentiles of height in cm; the horizontal lines show those percentiles of weight in kg, and the diagonal lines show the percentiles of BMI.

The challenge is to find a distribution for the red dots so that the lines are roughly in the right places to be the corresponding percentiles for the red dots as well.

The simplest reasonable model for this is that at each age, the height and weight have a bivariate normal distribution.

At this age, the height and weight might have a distribution with

  • mean height = 86.675 cm
  • mean weight = 12.608 kg
  • sd of height = 3.513 cm
  • sd of weight = 1.333 kg
  • correlation = 0.651

The CDC tells us that the the 3rd percentile of height is 79.91 cm — i.e. only 3% of the dots should lie below the lowest horizontal line. But for the above bivariate distribution, 2.7 percent of the distribution lies below 79.91, because $$\Phi((79.91-86.675)/3.513) = 0.027$$ So the error of the model for this datapoint is 2.7% - 3% = -0.3%.

The CDC also tells us that the 10th percentile of BMI is 15.09 — i.e. 10% of the red dots should lie below the third diagonal line. But for the above bivariate distribution, 10.8% of the distribution lies below that line, by the integral $$\int_{h=-\infty}^{\infty}\int_{w=-\infty}^{15.09(h/100)^2} P(height=h,weight=w)\, dw\, dh = .108$$ So the error of the model for this datapoint is 10.8% - 10% = 0.8%

We can similarly compute the error of the model from all nine percentiles of height, weight, and BMI. This gives a mean-squared-error of (2.1%)^2 averaged over the 27 datapoints, i.e. in this distribution, we typically see about 2.1% too many or too few points on each side of each line.

The model is a decent first fit because that error is the least possible error that I could find using Matematica in a quick-and-dirty minimization over all bivariate normal models. To take this calculation from quick-and-dirty to admissible would require weighting the errors according to $(p/100)(1-p/100)$, which is proportional to the variance of the above probability calculations. (See whuber's comments here.)

One could also make the model fit the data better by playing with lognormal or other distributions for weight and height, or playing with different copulas. In any case, this gets you some numbers to start from.

Matt F.
  • 1,656
  • 4
  • 20