Logistic regression: forcing linearity (automated feature creation)

Question

Suppose we want to fit a logistic regression on a binary outcome y and we have limited set of continuous independent variables x1, x2 .. xn. And none of these independent variables in their raw form have clearly linear relationship with respect to the outcome probability, yet there might be some polynomial f(x) that could represent this relationship.

Note: I'm trying to find an alternative to the binning, which is a current approach that is very flawed by its's nature.

So idea is that there exists some transformation into a polynomial form that achieves the best linear fit (via OLS) of every independent variable, taking one at a time, with respect to probability of outcome y, by exploring some finite solution grid for equation (might aswell be more complex polynomial) p(y) = b0 + b1*x^a1 + b2*x^a2 + b3*a3*log(x) + e where a1 and a2 is in range [-4;4] and a3 can be 0 or 1. And for every independent variable we'll get grid of solutions ranked by, lets say, R-squared, where the winner defines transformations applied to that particular variable.

Since we've now artificially increased independent variable count and have forced certain function shapes on outcomes, we're most likely exposing ourselves to multicollinearity problem. So now running a VIF routine and weeding out the most correlated variables before actually starting plugging them into the logistic regression might be a good idea.

Questions: So, most importantly - is this a generally good idea trying to force variables into providing more explanatory strength by polynomial transformations? Like many ideas in statistics - most likely someone has already explored this possibility and perhaps has a paper published on this particular topic, so if anyone knows any, references would be very appreciated.

Here is an example:

require(data.table);require(ggplot2)

### Data generation ###

set.seed(1)

dt <- data.table(y = sample(c(rep(1, 75), rep(0, 25)), 100, replace = T),
                 x = sample(1:10, 100, replace = T))
dt <- rbind(dt,
            data.table(y = sample(c(rep(1, 30), rep(0, 70)), 150, replace = T),
                       x = sample(11:20, 150, replace = T)))
dt <- rbind(dt,
            data.table(y = sample(c(rep(1, 70), rep(0, 30)), 100, replace = T),
                       x = sample(21:30, 100, replace = T)))

dt[,x_clean_cut:=cut(x, floor(nrow(dt)/15))]
dt.prob <- dt[,.(prob = sum(y)/.N, count = .N, med = median(as.numeric(x))),.(x_clean_cut)]

ggplot(dt.prob, aes(x = med, y = prob)) +
    geom_bar(stat="identity") +
    theme_light()

Probabilities:

It is tempting to do some sort of binning/clustering. Yet, I want to avoid it by any means and try to find a polynomial solution instead:

alpha.grid <- data.table(expand.grid(seq(-4, 4, .5), seq(-4, 4, .5), c(0,1)))
alpha.grid[,rsq:=apply(alpha.grid, 1, function(x){
    temp <- dt.prob[,.(y = prob, x1 = med^x[1], x2 = med^x[2], x3 = log(med)*x[3])]
    return(summary(lm(y ~ x1 + x2 + x3, data = temp))$r.squared)
})]

And here's what we get:

head(alpha.grid[order(rsq, decreasing = T)])

   Var1 Var2 Var3       rsq
1:  1.0  0.5    1 0.2588864
2:  0.5  1.0    1 0.2588864
3:  1.5  0.5    1 0.2559630
4:  0.5  1.5    1 0.2559630
5:  1.5  1.0    1 0.2520759
6:  1.0  1.5    1 0.2520759

Of course, order of square terms don't change the outcome, and result is not terribly good. But it is something that logistic regression will be happy to pick up. And instead of x1, this approach suggests using x1^1 + x1^0.5 + log(x1) as a better alternative.

If you work in R or STATA take a look at the 'mfp' (multivariable fractional polynomials) package by Sauerbrei, Royston et al. They have some excellent background material on this as well. — IWS, May 30 '17 at 11:48
@Roland Indeed, there is - I'm curious about different ways to automate prediction models. This particular idea is to initially inflate amount of continuous variables via, for example, polynomial transformation approach. And then shrink variable count with lasso regression. To find the most parsimonious model in terms of raw data variables required. — statespace, May 30 '17 at 13:35
If this is a learning exercise okay, otherwise `mgcv::gam` offers shrinkage. — Roland, Jun 01 '17 at 10:24
@A.Val. : I am struggling with similar problem. What was the conclusion of your research? — tusharmakkar08, Jan 05 '18 at 10:11

score 4 · Accepted Answer · answered May 30 '17 at 11:50

4

If you model probabilities using OLS, you could get probabilities <0 or >1. I'd recommend you use a standard link function and a logistic regression.

Of course you can transform your predictors before feeding them into the model. I'd recommend splines rather than polynomials. Be careful about overfitting - the standard caveats apply to logistic regression just as much as to OLS. Best to use cross-validation.

For variable transformations and similar topics, take a look at Frank Harrell's Regression Modeling Strategies. Very much recommended.

answered May 30 '17 at 11:50

Stephan Kolassa

95,027
13
197
357

So, the key question with this approach seems to be - if I can avoid introducing too much overfitting? Otherwise it is a perfectly viable alternative to battle the non-linearity by binning continuous variables. I would argue that this right out of the box (even with my very crude experiment) feels somehow better than binning. – statespace May 30 '17 at 12:09
1

Splines are pretty much always better than binning, because bins induce discontinuities at the bin boundaries, and [*natura non facit saltus*](https://en.wikipedia.org/wiki/Natura_non_facit_saltus). I agree that the key question is how to avoid overfitting. Cross-validation is a little less intuitive in logistic regression, since the error measure is less self-evident. [This thread may help.](https://stats.stackexchange.com/q/3559/1352) Alternatively, you could use regularization, like a logistic lasso or elastic net. `glmnet` in R might be helpful. – Stephan Kolassa May 30 '17 at 12:12
Yes, this exactly my intention. 1) Recover as much information from non-linear independents; 2) Account for model assumptions (mostly multicollinearity in this case); 3) Explore solution space of elastic-net to narrow down features to most important (by iterating `glmnet::cv.glmnet` to try and get mean out of lambda distribution). – statespace May 30 '17 at 12:23
After brief overview on some suggested pointers, I'm not sure that I understand the benefit of splines over polynomials in this case. Most importantly, elements of polynomials act as independent variables by themselves, therefore can benefit from being shrunk and excluded by lasso. Even if splines are better suited to fit non-linear data - how does it benefit me when I'm defining explicit variables within logistic regression? – statespace May 30 '17 at 13:15
1

For the benefits of splines over polynomials, take a look at Harrell's book. Most importantly, splines are linear in the tails, where polynomials fluctuate most strongly. Thus, if you fit polynomials, your model is most strongly determined by its behavior at extreme values of (polynomially transformed) predictors. And extrapolation has even higher variance. – Stephan Kolassa May 30 '17 at 13:33
For your other comment: splines also turn single predictors into multiple ones, which can be separately included/excluded or shrunk. As above, do take a look at Harrell's book, it's really extremely helpful. – Stephan Kolassa May 30 '17 at 13:34
This is a very valid point that dawned on me just a while ago - how power transforms affect extreme values. Guess, now it's time to research all the references before I continue. – statespace May 30 '17 at 13:40

score 0 · Answer 2 · answered May 30 '17 at 11:55

0

First, in logistic regression it's not the probability $p=P(y=1)$ that is modelled as a linear function of the coefficients, but the logit $\log\left(\frac{p}{1-p}\right)$. Second, you could to some extent automate the search for a perfect polynomial transformations of the original features with fractional polynomials as described in Hosmer, Lemeshow, Sturdivant. Applied logistic regression, 2013, 4.3.3, and implemented in R package mfp.

answered May 30 '17 at 11:55

Evgeniy Riabenko

578
2
11

1

Yes, I'm already reading papers on this exact topic as suggested by @IWS. And yes, I can agree that my approach is very crude as it served only for the conceptual purpose of conveying the idea in as compact form as possible. Thanks for the pointers in literature. – statespace May 30 '17 at 12:05

Logistic regression: forcing linearity (automated feature creation)

2 Answers2