5

I have been searching the internet for a generalized method for doing regression analysis on non linear data. My model can be represented as

$$Y = \beta_0f(X_0) + \beta_1g(X_1) + ... + \beta_nz(X_n) + \varepsilon$$

where I don't have any idea what $f() g() z()$ are. But I can constrict myself to a domain saying that

$$f(), g(), z(), \varepsilon \in [\sin(), \log(), x^2, x^3, 1/x, e^x, x] $$

Please forgive me for any terminology mistake, I mean $f(), g(), h()$ can be one of the functions given in that set.

I've researched that once we know the equation, in certain cases we can linearize it so the form becomes linear regression. Is there no way to do a regression analysis for this form then? Without knowing the equation itself?

I'm a better programmer than a statistician and so I'm not averse to taking an iterative approach substituting the functions in each stage as long as someone can please guide me through the iterative process.

Further, isn't this model more frequently encountered in real life? I haven't seen any examples of this at all on the web.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Did you mean to include the error term ($\varepsilon$) as something that could take one of the functions? – gung - Reinstate Monica Nov 09 '14 at 16:32
  • What is the nature of your response variable, Y? – gung - Reinstate Monica Nov 09 '14 at 17:12
  • 2
    As a matter of terminology--which can help both in searching for information and interpreting it correctly once you find it--you situation is neither "generalized" nor "nonlinear." $Y$ is explicitly a *linear* function of the parameters $\beta_i$; this is what makes it a *linear* model. A "generalized" model would make specific assumptions about the distributional family of $\varepsilon$; these are usually called [GLMs](http://stats.stackexchange.com/search?q=GLM). Your problem seeks *re-expressions* of the *independent variables* in order to create a linear relationship. – whuber Nov 09 '14 at 17:44
  • On this site there are *loads* of questions about this, with examples and much discussion of when to use alternative approaches. Try searching our site: http://stats.stackexchange.com/search?q=transform+independent+variable. – whuber Nov 09 '14 at 17:46
  • you might want to investigate additive models. – Glen_b Nov 09 '14 at 23:35

1 Answers1

2

Scrolling through all those possibilities for all those variables and all combinations will lead to a combinatorial explosion. In addition, you final model will be pretty much guaranteed to be overfitted. Instead you should fit a model with sufficient flexibility to mimic whatever function happens to obtain. That is, you should use spline functions for each variable. Then you build a multivariable model based on these (see MARS). I also discussed splines here: What are the advantages / disadvantages of using splines, smoothed splines and Gaussian process emulators?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 1
    Your recommendation may be a little too strong. This "combinatorial explosion" is elegantly and efficiently handled with, say, a genetic algorithm, as implemented in [Eureqa](http://www.nutonian.com/products/eureqa/), for instance. It can be entirely avoided with appropriate EDA, often to great effect. Splines do make sense in settings where regular mathematical laws cannot be expected to apply--in most social science settings for instance. The transformations listed by the OP can be meaningful and even suggested theoretically in other settings, especially scientific and engineering ones. – whuber Nov 09 '14 at 17:50
  • @gung Thanks for your reply! I have used splines previously in Vector Graphic Design. Will look into MARS. If, in case the combinatorial explosion isn't really a problem for me can I go ahead and apply Simple Regression even with Sine, Cosine, Logarithmic terms? – Gaurav Ramanan Nov 10 '14 at 18:01
  • @GauravRamanan, the point of the splines is that they will replicate the correct function whether it is sin, cos, log, etc, & you won't have the overfitting issue from trying lots of different things & picking the best. The overfitting is potentially a problem even if the combinatorics are tractable. – gung - Reinstate Monica Nov 10 '14 at 18:04
  • @gung Thanks a lot for your answer. I did research in MARS and also played around the `earth` package in R. Can you just explain / point me to some simple theoretical foundation as to why Combinatorial will lead to over fitting? – Gaurav Ramanan Dec 06 '14 at 09:19
  • @whuber I'm a stats noob so interesting to note that methods depend on domain and not just the data. I was thinking of a general purpose solution but let us say specifically I start with an econometrics / financial data problem. Do you still think the combinations strategy is a good idea? – Gaurav Ramanan Dec 06 '14 at 09:21
  • 1
    @GauravRamanan, for a start, you could try my answer here: [Algorithms for automatic model selection](http://stats.stackexchange.com/a/20856/7290). – gung - Reinstate Monica Dec 06 '14 at 14:34