10

I have to create charts (similar to growth charts) for children of ages 5 to 15 years (only 5,6,7 etc; there are no fractional values like 2.6 years) for a health variable which is non-negative, continuous and in the range of 50-150 (with only a few values outside this range). I have to create 90th, 95th and 99th percentile curves and also create tables for these percentiles. The sample size is about 8000.

I checked and found following possible ways:

  1. Find quantiles and then use loess method to get a smooth curve from these quantiles. The degree of smoothness can be adjusted by 'span' parameter.

  2. Use LMS (Lambda-Mu-Sigma) method (e.g. using gamlss or VGAM packages in R).

  3. Use quantile regression.

  4. Use mean and SD of each age group to estimate percentile for that age and create percentile curves.

What is the best way to do it? By 'best' I mean either the ideal method which is the standard method for creation of such growth curves and would be acceptable to all. Or an easier and simpler to implement method, which may have some limitations, but is an acceptable, quicker method. (For example using loess on percentile values is much faster than using LMS of gamlss package).

Also what will be the basic R code for that method.

Thanks for your help.

rnso
  • 8,893
  • 14
  • 50
  • 94
  • 2
    You're asking for "best" which is usually anywhere between difficult and impossible to discuss definitively. (The "best" measure of level is difficult enough.) You have clearly tied your question to health changes in children, but your criteria on "best" are not explicit, in particular what kinds or degrees of smoothness are acceptable or unacceptable. – Nick Cox Dec 17 '14 at 12:37
  • I welcome the attempt, but a) evidently doesn't exist, else why are there competing solutions, or why isn't this evident in the literature you are reading? Interest in this problem is surely decades if not centuries old. Easier means: easier to understand, easier to explain to medics or non-statistically minded professionals in general, easier to implement, ...? I am no doubt seeming picky, but why should you care about speed here? None of these methods is computationally demanding. – Nick Cox Dec 17 '14 at 13:14
  • @NickCox : I have edited the question according to your comments. I will appreciate a real answer. – rnso Dec 18 '14 at 05:03
  • 1
    Sorry, but I don't work in this field and I think your question is too elusive to answer. Comments exist because people may be unable or unwilling to answer but nevertheless have something to say. I don't write answers to order. – Nick Cox Dec 18 '14 at 10:21

2 Answers2

6

There is a large literature on growth curves. In my mind there are three "top" approaches. In all three, time is modeled as a restricted cubic spline with a sufficient number of knots (e.g., 6). This is a parametric smoother with excellent performance and easy interpretation.

  1. Classical growth curve models (generalized least squares) for longitudinal data with a sensible correlation pattern such as continuous-time AR1. If you can show that residuals are Gaussian you can get MLEs of the quantiles using the estimated means and the common standard deviation.
  2. Quantile regression. This is not efficient for non-large $n$. Even though precision is not optimal, the method makes minimal assumptions (because estimates for one quantile are not connected to estimates of a different quantile) and is unbiased.
  3. Ordinal regression. This treats continuous $Y$ as ordinal in order to be robust, using semi-parametric models such as the proportional odds model. From ordinal models you can estimate the mean and any quantiles, the latter only if $Y$ is continuous.
Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • When you have used proportional odds, how did you accommodate the PO assumption (assuming it failed) with so many levels of the outcome? Thanks. – julieth Jan 03 '15 at 18:59
  • 2
    Even if it fails the model may perform better than some of the other models because of fewer assumptions overall. Or switch to one of the other ordinal models cumulative probability family such as proportional hazards (log-log cumulative prob. link). – Frank Harrell Jan 03 '15 at 22:10
1

Gaussian process regression. Start with the squared exponential kernel and try and tune the parameters by eye. Later, if you want to do things properly, experiment with different kernels and use the marginal likelihood to optimize the parameters.

If you want more detail than the tutorial linked above provides, this book is great.

Andy Jones
  • 2,146
  • 9
  • 10
  • Thanks for your answer. How do you rate Gaussian process regression compared to other methods mentioned. The second Gaussian plot on http://scikit-learn.org/0.11/auto_examples/gaussian_process/plot_gp_regression.html appears very similar to second last plot on this page of LOESS (local regression): http://princeofslides.blogspot.in/2011/05/sab-r-metrics-basics-of-loess.html . LOESS is much easier to perform. – rnso Dec 17 '14 at 11:12
  • Personally, I strongly prefer GPR for any dataset that's small enough to let you fit it. As well as being much "nicer" from a theoretical perspective, it's more flexible, robust, and gives well-calibrated probabilistic output. Having said all that, if your data is dense and well-behaved, then your audience probably won't be able to tell the difference between LOESS and a GPR unless they're statisticians. – Andy Jones Dec 17 '14 at 11:44
  • I tried to use 'basic introductory example' on http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_regression.html but found even that to be too complex for me. Can you show how 90th percentile curve can be generated from example code on this page: http://stackoverflow.com/questions/27507448/getting-percentile-values-from-gamlss-centile-curves – rnso Dec 17 '14 at 11:52
  • 3
    I can't see that this answer addresses the specific features of wanting percentile curves. The question is emphatically not asking what's a good way to smooth $y$ as a function of $x$? – Nick Cox Dec 17 '14 at 12:33
  • 1
    @Nick: My intended advice was to construct a model of your data and then use the model to construct the (smooth) percentile curves. Now you've mentioned it, yeah I completely missed the second component (ie the actual question). – Andy Jones Dec 17 '14 at 12:45
  • @rnso: The "noisy case" in the scikitlearn sample code does pretty much exactly what you need. Only things you need to change are the data it's fed (the definitions of `X` and `y`), the parameters of the model, and the constants $\pm 1.96$ in the plotting code, which correspond to how many standard deviations away from the mean you want to draw your error limits. The 1.96 used there corresponds to a 95% percentile cylinder. – Andy Jones Dec 17 '14 at 12:47
  • 1
    Using $1.96$ for making such limits is a very strong assumption (based on Normality) that in fact may be violated by growth curves. – whuber Jan 02 '15 at 17:00