0

The simulated data has 9 (All continuous) independent variables and 500 observations, the given response variable is a continuous variable. Currently, I am at an R squared of 0.965 with 22 variables. During this process, I have used tools like bagged decision trees, correlation matrices, splines, and PCR. My only included variables are in polynomial or dummy variable form.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • 3
    Does this answer your question? [How to know that your machine learning problem is hopeless?](https://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless) – Stephan Kolassa Apr 07 '20 at 18:09
  • Hello, I should note this data is capable of producing an R squared of the given amount. Essentially, it is doable, I am mostly seeking advice on any other methods for variable selection. – Nicholasislearningthings Apr 07 '20 at 21:25
  • I am not quite sure what the purpose of achieving a high $R^2$ for a simulated dataset is. If this is training for a specific application domain, then you should have some domain knowledge about likely useful transformations you haven't used yet, which should also have gone into the simulation. Have you looked at interactions, log transforms, or splines? – Stephan Kolassa Apr 08 '20 at 05:37
  • Interactions were quite helpful, and alas, the issue was resolved! This is my first statistical learning course, which is mostly about learning theory or applied practice problems so a freelance question was a bit of a wildcard for me. My professor was looking for a creative resolution to the simulated data set based on the skills we have learned. Stackexchange always adds some useful input on resolving these issues, thanks Stephan! – Nicholasislearningthings Apr 09 '20 at 16:18

0 Answers0