1

Is there a systematic method for Logistic Regression to do transformations on the independent variables, in order to conclude that the most optimal logistic regression model is fitted?

Illustration of my question:

I have 2 independent variables (strength and minute both on a scale of 0 to 100) and the dependent variable is a binary outcome: win or lose.

I have done a logistic regression (in R with glm) to estimate the probabilities.

Since I already have the probabilities (based on an interpolated pivottable from a very large dataset, see picture below);

enter image description here

I can compare these with the probabilities from the logistic regression, see picture below:

enter image description here

I conclude that the logistic regression does not capture the non-linearity. For that reason I have added several transformations to the independent variables such as an interaction term, 2nd, 3rd and 4th power.

These additions result in a slight improvement measured by AUC after splitting the dataset and performing cross validations. Although visual the 4th power (red) is a deterioration compared to the 3rd power transformation (green); see picture below with the outcomes of 1 vector of the matrix with strength 50, deducted from the similar vector from the pivot table).

enter image description here

Do I have to conclude that Logistic Regression is not flexible enough for my data or that further improvement is possible by doing other transformations? In that case, is there a systematic structured method to do those transformations, in order to conclude that the most optimal model is fitted as possible by Logistic Regression?

PS: if there are any incorrect steps or conclusions above, pls let me know.

Choosing between transformations in logistic regression

Marcel
  • 89
  • 2
  • 8

1 Answers1

1

There are lots of ways of transforming variables. There are also some caveats: The complex transformation may lack substantive meaning and it is easy to overfit data. The second of these mandates (or, at least, strongly suggests) using training and test data sets.

You can use a spline of the independent variable. There's many choices of type of spline, in my experience, they tend to give similar results, but some experts feel strongly about one or another type of spline.

There is also a method called optimal scaling. I've only used this in SAS, but I am sure it exists in some R package.

Then you could consider a classification tree, which allows some very complex transformations of the IVs.

I'm not sure there's any way to guarantee a "best" model since it's not clear what the universe of models should include.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • Thank you Peter. I suppose I can conclude that the risk of overfitting is limited in this case since the first heatmap is very smooth and gradual? The reason I try to fit a model (I already have by the pivottable) by LR is that afterwards I intend to add other variables with the step method (which is not possible with the pivottable). Are all 3 methods you suggest viable as a basis in that case? WIth "best", I mean best LR model. Currently I have tried several transformations although how do I know that there is no transformation left which will improve the results. – Marcel Jan 12 '19 at 15:33
  • 1
    The step method of variable selection is terrible. (If you mean what I think you mean "stepwise" composed of "forward" and "backward") that's been discussed a lot here. However, you can definitely have one variable as a spline and have other variables too. – Peter Flom Jan 13 '19 at 11:23