What are the pros and cons to fit data with simple polynomial regression vs. complicated ODE model?

Question

Suppose in a disease outbreak scenario and we want to estimate number of infected people based infections over time.

Why we cannot simply fit the data with some polynomials (or some MLP neural network)?

what are the advantages of using some complicated model such as SIR model from ODE?

(Attached code and plot is an example of fitting a high order polynomial (red line) with SIR model generated data (black dots), we can see we are getting an almost perfect fit.)

library(deSolve)

# generate data from SIR Model
N <- 1000
init <- c(S = 999, I = 1, R = 0)

SIR <- function(time, state, parameters) {
  par <- as.list(c(state, parameters))
  with(par, { dS <- -beta * (S/N) * I
  dI <- beta * (S/N) * I - gamma * I
  dR <- gamma * I
  list(c(dS, dI, dR))
  })
}
out <- ode(init, seq(1000), func = SIR, parms = c(beta=0.1, gamma=0.01))

# fit with high order polynomial
d = as.data.frame(out[50:300,])
names(d) = c('time', 'susceptible', 'infected', 'recovered')
poly_fit  = lm(infected~poly(time,15),d)
plot(d$time, d$infected)
lines(d$time, predict(poly_fit, d), col ='red', lwd = 3)
grid()

Polynomial fits provide no insight, no assurance of following biological laws, and no ability to forecast accurately. ODEs hold out the promise of achieving all three of these goals. — whuber, May 07 '20 at 12:02

Haitao Du · Answer 1 · 2020-05-07T04:28:17.257

Just extend time a little bit, we can see how terrible is the polynomial fit:

plot(seq(30,320), predict(poly_fit, data.frame(time = seq(30,320))), type='l', 
col='red')
points(d$time, d$infected)
grid()

From machine learning perspective, we say the polynomial fit is overfitting.

For SIR model, differential equations are describing the underline physical laws and interactions between variables.
But the curve fitting approach is just try to minimize the loss with many parameters that do not have physical meaning. As a result, we will get loss minimized / perfect fit for training data. But the system is not describing any physics.

For pros and cons, SIR fitting vs. polynomial fitting is very similar to the discussion on "parametric model vs. non-parametric model".

For example, if we are fitting data with normal distribution or using kernel density estimation.

If the data is really come from normal distribution or mostly satisfy model assumptions, then fitting the data to normal distribution is better than non-parametric estimation.
On the other hand, if data is far way from model assumptions, say contains a lot of outliers, then fitting data with non-parametric methods will have better results.

Similar question as been as asked

What's wrong to fit periodic data with polynomials?

And one of the still apply to here:

Intuitively you want to fit function that (in some sense) looks like your underlying process. This way you'll have the fewest number of parameters to estimate. Say you have a round hole, and need to fit a cork into it. If your cork is square it's harder to fit it well than if the cork were round.

Just small comment on your last comment/question: you can give a look at this paper by JO Ramsay et al. (I think you will find it really interesting...little spoiler: ODEs, piecewise polynomials and regularization together ^_^ ) https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9868.2007.00610.x — Gi_F., Apr 21 '20 at 12:54

doubllle · Answer 2 · 2020-05-01T11:58:46.193

I actually wondered the reason of not choosing mechanistic modeling if it models the data well. I would always favor ODE if it is feasible for a known system and good observations.

The primary goal of machine learning is to find a model which can approximate well the underlying patterns of observed data, when we don't have much knowledge about the target system or there are too many entangled parts of the system. This also highlights ML's better applicability and worse interpretability in comparison to mechanistic modeling.

A few words of my understanding about modeling:

Essentially, modeling is to abstract the essentials from “real world” objects or phenomena to build their representations. Models enable us to investigate ideas for generating scientific hypotheses. To build sensible mechanistic models we will need good knowledge of the real system. For instance, if we want to know how fast the enzymes in our stomach catalyze the digestion of the proteins in our food, we need to understand in general how enzymatic reactions work, but we wouldn't need to know how genes encode such enzymes. The well-known Michaelis-Menten Equation captured the essentials representations of the enzymatic reactions in food digestion, therefore it is a good model. On the other hand, tons of factors are involved in forming a protein structure, therefore ML would show its advantage over mechanistic models in predicting protein structures, especially when we have lots of data at hand.

A mechanistic model has advantages, but it is not always easy to achieve a mechanistic model or to perform the fit, and also a mechanistic model might be just as well biased if the underlying mechanism is incorrect (e.g. [here](https://stats.stackexchange.com/a/449136/164061)) or too much simplified (like [here](https://stats.stackexchange.com/a/461455/164061)). The mechanistic models used for modeling disease outbreaks contain so many parameters with uncertainty that they at times become effectively an empirical model. — Sextus Empiricus, Apr 30 '20 at 21:19
@SextusEmpiricus I definitely agree with you. Oversimplification of a real system would render a mechanistic model useless. We will need good knowledge of the system to make sensible assumptions such that the model can still capture the essentials of interest. I updated my answer to make it less ambiguous. — doubllle, May 01 '20 at 11:45
I would not say useless, but it would render the model effectively an empirical model (which can still be useful). — Sextus Empiricus, May 01 '20 at 12:27

What are the pros and cons to fit data with simple polynomial regression vs. complicated ODE model?

2 Answers2

Linked