19

Context:

From a question on Mathematics Stack Exchange (Can I build a program), someone has a set of $x-y$ points, and wants to fit a curve to it, linear, exponential or logarithmic. The usual method is to start by choosing one of these (which specifies the model), and then do the statistical calculations.

But what is really wanted is to find the 'best' curve out of linear, exponential or logarithmic.

Ostensibly, one could try all three, and choose the best fitted curve of the three according to the best correlation coefficient.

But somehow I'm feeling this is not quite kosher. The generally accepted method is to pick your model first, one of those three (or some other link function), then from the data calculate the coefficients. And post facto picking the best of all is cherry picking. But to me whether you're determining a function or coefficients from the data it is still the same thing, your procedure is discovering the best...thing (let's say that which function is -also- another coefficient o be discovered).

Questions:

  • Is it appropriate to choose the best fitting model out of linear, exponential, and logarithmic models, based on a comparison of fit statistics?
  • If so, what is the most appropriate way to do this?
  • If regression helps find parameters (coefficients) in a function, why can't there be a discrete parameter to choose which of three curve families the best would come from?
Mitch
  • 1,691
  • 2
  • 18
  • 33
  • 1
    I have added the [tag:model-selection] tag for your convenience: linking through it will produce a large number of directly relevant threads. Other tags worth looking at include [tag:aic]. You should eventually discover that the mathematical statement of this problem is missing two essential elements: a description of how and why the points might deviate from a theoretical curve and an indication of the cost of not getting exactly the right curve. Absent those elements, there are many different approaches that can produce different answers, showing that "best" is ill-defined. – whuber Oct 05 '13 at 16:07
  • 1
    You could set aside a percentage of your data to do validation on the model and pick the model that fits that set of validation data best. So you would in essence have three distinct sets to split your data into 1. the data to train a single model 2. data that validates each model that allows you to select the best model and 3. your actual final validation data that is not touched. – kleineg Aug 09 '18 at 13:51
  • 1
    @kleineg That sounds like the right direction. The choice of model (eg between lin/exp/log) is like a single model hyperparameter, which are in some ways just another stage of regular parameters, and stepping into it by separate train/validate/test stages could be generalized. – Mitch Aug 09 '18 at 14:34
  • Relevant: {A subtle way to overfit](https://www.johndcook.com/blog/2015/03/17/a-subtle-way-to-over-fit/) - choosing between multiple model functions (eg exp vs linear vs log) is just another parameter. You could think of it as a hyperparameter (which would need a validation step) or a regular parameter in a complicated function of combination (where it would be tested in a test step). – Mitch Aug 29 '19 at 12:44

4 Answers4

12
  • You might want to check out the free software called Eureqa. It has the specific aim of automating the process of finding both the functional form and the parameters of a given functional relationship.
  • If you are comparing models, with different numbers of parameters, you will generally want to use a measure of fit that penalises models with more parameters. There is a rich literature on which fit measure is most appropriate for model comparison, and issues get more complicated when the models are not nested. I'd be interested to hear what others think is the most suitable model comparison index given your scenario (as a side point, there was recently a discussion on my blog about model comparison indices in the context of comparing models for curve fitting).
  • From my experience, non-linear regression models are used for reasons beyond pure statistical fit to the given data:
    1. Non-linear models make more plausible predictions outside the range of the data
    2. Non-linear models require fewer parameters for equivalent fit
    3. Non-linear regression models are often applied in domains where there is substantial prior research and theory guiding model selection.
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
8

This is a question that is valid in very diverse domains.

The best model is the one that can predict data points that were not used during the parameter estimation. Ideally one would compute model parameters with a subset of the data set, and evaluate the fit performance on another data set. If you are interested in the details make a search with "cross-validation".

So the answer to first question, is "No". You cannot simply take the best fitting model. Image you are fitting a polynomial with Nth degree to N data points. This will be a perfect fit, because all the model will exactly pass on all data points. However this model will not generalize to new data.

When you do not have enough data to go through the cross-validation procedure in a sound manner, then you can use metrics such as AIC or BIC. These metrics punishes simultaneously the amplitude of residuals and the number of parameters in your model but makes strong assumptions on the generative processes of your data. As these metrics are sensitive to over-fitting, they can be used as a proxy for model selection.

bonobo
  • 418
  • 4
  • 8
5

Since plenty of people routinely explore the fit of various curves to their data, I don't know where your reservations are coming from. Granted, there is the fact that a quadratic will always fit at least as well as a linear, and a cubic, at least as well as a quadratic, so there are ways to test the statistical significance of adding such a nonlinear term and thus to avoid needless complexity. But the basic practice of testing many different forms of a relationship is just good practice. In fact, one might start with a very flexible loess regression to see what is the most plausible kind of curve to fit.

rolando2
  • 11,645
  • 1
  • 39
  • 60
  • 3
    Whether quadratic fits better, will depend on how you have operationalised good fit. In particular, if you use a measure of fit that penalises models with more parameters (e.g., AIC), then, for example, fit can be worse for quadratic versus linear. – Jeromy Anglim Apr 08 '11 at 03:28
  • 9
    @rolando, perhaps I am misunderstanding, but, frankly this sort of (unqualified) advice is precisely the kind of thing that, as statisticians, we spend so much time "fighting" against. Particularly, if the OP is interested in anything beyond simple curve fitting, e.g., prediction or inference, it is *very* important to understand the implications of the "just try whatever you can think of" approach to statistics. – cardinal Apr 08 '11 at 12:46
  • @cardinal, I agree with you. The choice of the model should reflect some prior knowledge of the system and terms should generally not be added simply to improve the fit. I do encourage my students to use plots, correlations, etc... as tools to explore their data initially - but like you said with the goal of understanding the results and thus the implications of what they have done. – DQdlM Apr 08 '11 at 15:29
  • @cardinal - I am thinking carefully about your comments. If you care to elaborate I'd like to hear more. – rolando2 Apr 08 '11 at 19:57
  • @cardinal, #rolando2: I'd like some elaboration for both. What are the pitfalls or benefits of doing three separate regressions, 1 linear, and 1 each with exponential and logarithmic transformed data as opposed to, say, 1 model with 6 params ($a x + b + c \exp(d x) + log(f x + g)$, or something else entirely. – Mitch Apr 09 '11 at 14:49
  • 1
    Cardinal is right. Trying different fits or using loess to guide fits results in improper inference down the road (e.g., confidence bands that are too narrow, statistical tests that don't preserve type I error). When there are no mechanistic reasons to choose a parametric form, fitting a restricted cubic (natural spline) to the data using the number of papers that the sample size well supports will result in an excellent fit and proper inference (proper penalty for # parameters in the model = number of visable parameters; don't need to consider parameters "screened out" from earlier looks). – Frank Harrell May 19 '11 at 15:20
  • 2
    I'm having trouble reconciling these comments with the tradition of Anscombe, Tukey, Mosteller, Tufte, and Cleveland, which emphasizes the need to visualize and explore data and to size up the shape of each relationship before building a model, establishing coeffiencients, or generating other statistics. – rolando2 May 19 '11 at 18:40
  • 9
    There is a lot of controversy regarding their approaches. An over-simplified way to summarize these issues is that if you want to learn about patterns and make new discoveries that need later validation, exploratory analysis is appropriate. If you want to draw inference (reason from particular sample to general population using P-values, confidence intervals, etc.) then not so much. – Frank Harrell May 21 '11 at 12:34
  • I think I can relate to that. – rolando2 May 25 '11 at 01:08
  • 4
    This is the most productive comment thread I've seen on CV, especially the exchange b/t rolando2 (3^) & @FrankHarrell. I also find both approaches very appealing. My own resolution is to plan what to test beforehand & *only* fit/test *that* model for the sake of drawing firm conclusions, but also thoroughly explore the data (w/o believing the results necessarily hold) for the sake of discovering what *might* be true & planning for the *next* study. (Should I run another study & check something, would it be interesting/important?) The key is your *beliefs* about the results of these analyses. – gung - Reinstate Monica Feb 18 '12 at 20:38
4

You really need to find a balance between the science/theory that leads to the data and what the data tells you. Like others have said, if you let yourself fit any possible transformation (polynomials of any degree, etc.) then you will end up overfitting and getting something that is useless.

One way to convince yourself of this is through simulation. Choose one of the models (linear, exponential, log) and generate data that follows this model (with a choice of the parameters). If your conditional variance of the y values is small relative to the spread of the x variable then a simple plot will make it obvious which model was chosen and what the "truth" is. But if you choose a set of parameters such that it is not obvious from the plots (probably the case where an analytic solution is of interest) then analyze each of the 3 ways and see which gives the "best" fit. I expect that you will find that the "best" fit is often not the "true" fit.

On the other hand, sometimes we want the data to tell us as much as possible and we may not have the science/theory to fully determine the nature of the relationship. The original paper by Box and Cox (JRSS B, vol. 26, no. 2, 1964) discusses ways to compare between several transformations on the y variable, their given set of transformations have linear and log as special cases (but not exponential), but nothing in the theory of the paper limits you to only their family of transformations, the same methodology could be extended to include a comparison between the 3 models that you are interested in.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159