1

Imagine I would like to correlate the blood pressure with the weight of a person. Now, imagine I would like to use symbolic regression and I discover, for example, that log(weight) x 1.2345 x weight ^ 6.7890 produces results which correlate with the real blood pressure values with a Person's r of 0.90.

I read Linear regression effect sizes when using transformed variables, but of course I would perform a very complex transformation of the real data.

Could I discover something meaningful or is it a biased approach?

(I use R)

  • 2
    what's symbolic regression? – Aksakal Apr 05 '18 at 16:49
  • Here it is a very nice explanation: https://en.wikipedia.org/wiki/Symbolic_regression – statisticianwannabe Apr 05 '18 at 17:28
  • 1
    The problem with that explanation is that it is too vague to enable us to answer your question. The answer depends on what space of possible arithmetic expressions you have explored as well as how you went about doing it. Most of the software will at least perform cross-validation and some might even test the results against a held-out dataset. In those cases your question is practically answered already. – whuber Apr 05 '18 at 18:22
  • 1
    this reminds me of the software package called Eureka that was popular in late 1980s - early 1990s that would do this kind of stuff – Aksakal Apr 05 '18 at 18:26
  • 1
    The proof is in the pudding. When you test your clever equation on new data not involved in the creation of the equation, you will know how well it performs. Hopefully you will do this with a large sample and/or multiple times, to allow a robust test. – rolando2 Apr 05 '18 at 18:44
  • Thanks to you all for your answers. So, if I get it right, do you think that my approach is correct as long as I validate the results using another sample? – statisticianwannabe Apr 05 '18 at 20:56
  • 1
    Are you interested in predicting the output given new inputs, or in learning about the nature of the relationship between input/output? If the former, there are many regression techniques that will be much more efficient than symbolic regression. If the latter, there are infinitely many symbolic expressions that fit the data equally well, and you'll have to face some challenges: 1) How to meaningfully choose one of these over the others, 2) How to efficiently find it in the first place (and know you've found it), given the complicated, discrete nature of the search space. – user20160 Apr 06 '18 at 04:30
  • I would use symbolic regression to predict the output given new inputs and to learn about the nature of the relationship between input/output, just as user20160 wrote. I will study more the problem, thanks to you all again for your interesting advice – statisticianwannabe Apr 07 '18 at 10:37

0 Answers0