Questions re. fitting a polynomial: smoothing, cross-validation, etc

Question

I have data on hospital treatment times. I would like to fit a polynomial to the data. My data comes in 5 minute increments and it is very noisy. It looks like this:

I can aggregate to a higher level (30mins and 3hrs):

This makes it smoother, but it also leads to a loss of observations.

The hump around the zero point is a policy-related distortion and is my main interest. I use a polynomial function excluding this region to identify what the distribution would look like without this distortion. My final fitting looks like this:

I have the following questions/concerns:

I understand that I can smooth my data out by aggregating to higher units (i.e. going from 5 minutes to 3 hours), but I am also loosing precision and number of observations this way. Would there be a way to smooth out the initial 5-minute distribution(but without smoothing out the hump around the zero point!) and then do the polynomial fitting?
I use 5-fold cross-validation to identify the best fitting polynomial by minimizing the MSE. According to my cv the best fitting (orthogonal) polynomial (with lowest MSE) has the power of 12 and looks like this:

The power of 12 sounds excessively high to me and I worry that I am overfitting my data. Other papers seem to quote power of 6. Is this high power a problem? Could it be that the low number of observations or the fact that I am fitting to a frequency of zero at the end lead to this problem?

Your help is greatly appreciated.

(1) You absolutely do not want to fit polynomials. (2) What does a negative treatment time mean? — whuber, Aug 23 '19 at 13:27
The treatment times are centered around the distortion (in reality the zero point is around 50 hrs of treatment). I can see how that would be confusing. I'm sorry. — Stata_user, Aug 23 '19 at 13:30
@whuber: what do you mean I absolutely do not want to fit polynomials? — Stata_user, Aug 23 '19 at 13:31
Many, many good reasons. One is that this curve is not polynomial-shaped. Another is that fitting polynomials gives notoriously bad fits. Another is that very few natural processes--and that includes those determined by human systems--are even approximately governed by high order polynomials. In particular, *polynomial functions (of degree greater than zero) never reach asymptotic levels,* whereas obviously any realistic treatment time distribution must asymptotically reach zero. — whuber, Aug 23 '19 at 14:15
BTW, you threw away essential information by recentering your treatment times. Go back to the original data, *which have a natural and meaningful origin* (of zero). That will improve any efforts to model this distribution, for otherwise any model will have to estimate where that true zero is. — whuber, Aug 23 '19 at 14:20

user2974951 · Answer 1 · 2019-08-23T12:22:02.707

You do not need to do any aggregating for fitting a polynomial (or otherwise), you should fit on the whole data. And yes, a polynomial of degree 12 is absolute garbage. I have doubts that a CV procedure returned a model with 12 degrees as the best fit. Rather than polynomials you may be better off using cubic natural splines.

Or alternatively, it looks like your data follows an exponential / Weibull model. So you could just find the parameters which can describe your data for those distributions. Well, if you shift your data to the positive side first.

score 0 · Accepted Answer · answered Aug 23 '19 at 14:46

0

As the previous comment says, these data look exponential. Try plotting the log of frequency. This may fit a straight line, and if not perhaps a polynomial.

Incidentally, exponential time is what you would predict if the treatment strategy is "try a bunch of things until one of them works", or alternatively if it is "wait until the patient spontaneously transitions to recovered or dead". It is surprising if the treatment contains a constant time process e.g. prep for surgery.

answered Aug 23 '19 at 14:46

chrishmorris

820
5
5

Thanks for your feedback. I would like to try fitting an exponential/Weibull function, but how do I leave out the range near the disturbance. So far, my methodology has been based on the bunching literature in Economics (see Kleven-2016), which says to estimate a polynomial with dummies around the disturbance, and drop those dummies when predicting the distribution without the disturbance. – Stata_user Aug 26 '19 at 08:22
I am happy to try exponential function, but how do I specify a model to "leave out" that range? Thanks for your help. – Stata_user Aug 26 '19 at 08:23
If you want to ignore the "disturbance" then simply filter out values in -5 to +5. But I don't think that this will make much difference to the fit, since the disturbance is symmetrical. After you have the fit, calculate residuals: that is where these data get interesting. – chrishmorris Oct 16 '19 at 10:36

Questions re. fitting a polynomial: smoothing, cross-validation, etc

2 Answers2

Linked