Suppose we want to fit a logistic regression on a binary outcome y
and we have limited set of continuous independent variables x1
, x2
.. xn
. And none of these independent variables in their raw form have clearly linear relationship with respect to the outcome probability, yet there might be some polynomial f(x)
that could represent this relationship.
Note: I'm trying to find an alternative to the binning, which is a current approach that is very flawed by its's nature.
So idea is that there exists some transformation into a polynomial form that achieves the best linear fit (via OLS) of every independent variable, taking one at a time, with respect to probability of outcome y
, by exploring some finite solution grid for equation (might aswell be more complex polynomial) p(y) = b0 + b1*x^a1 + b2*x^a2 + b3*a3*log(x) + e
where a1
and a2
is in range [-4;4]
and a3
can be 0 or 1
. And for every independent variable we'll get grid of solutions ranked by, lets say, R-squared, where the winner defines transformations applied to that particular variable.
Since we've now artificially increased independent variable count and have forced certain function shapes on outcomes, we're most likely exposing ourselves to multicollinearity problem. So now running a VIF routine and weeding out the most correlated variables before actually starting plugging them into the logistic regression might be a good idea.
Questions: So, most importantly - is this a generally good idea trying to force variables into providing more explanatory strength by polynomial transformations? Like many ideas in statistics - most likely someone has already explored this possibility and perhaps has a paper published on this particular topic, so if anyone knows any, references would be very appreciated.
Here is an example:
require(data.table);require(ggplot2)
### Data generation ###
set.seed(1)
dt <- data.table(y = sample(c(rep(1, 75), rep(0, 25)), 100, replace = T),
x = sample(1:10, 100, replace = T))
dt <- rbind(dt,
data.table(y = sample(c(rep(1, 30), rep(0, 70)), 150, replace = T),
x = sample(11:20, 150, replace = T)))
dt <- rbind(dt,
data.table(y = sample(c(rep(1, 70), rep(0, 30)), 100, replace = T),
x = sample(21:30, 100, replace = T)))
dt[,x_clean_cut:=cut(x, floor(nrow(dt)/15))]
dt.prob <- dt[,.(prob = sum(y)/.N, count = .N, med = median(as.numeric(x))),.(x_clean_cut)]
ggplot(dt.prob, aes(x = med, y = prob)) +
geom_bar(stat="identity") +
theme_light()
It is tempting to do some sort of binning/clustering. Yet, I want to avoid it by any means and try to find a polynomial solution instead:
alpha.grid <- data.table(expand.grid(seq(-4, 4, .5), seq(-4, 4, .5), c(0,1)))
alpha.grid[,rsq:=apply(alpha.grid, 1, function(x){
temp <- dt.prob[,.(y = prob, x1 = med^x[1], x2 = med^x[2], x3 = log(med)*x[3])]
return(summary(lm(y ~ x1 + x2 + x3, data = temp))$r.squared)
})]
And here's what we get:
head(alpha.grid[order(rsq, decreasing = T)])
Var1 Var2 Var3 rsq
1: 1.0 0.5 1 0.2588864
2: 0.5 1.0 1 0.2588864
3: 1.5 0.5 1 0.2559630
4: 0.5 1.5 1 0.2559630
5: 1.5 1.0 1 0.2520759
6: 1.0 1.5 1 0.2520759
Of course, order of square terms don't change the outcome, and result is not terribly good. But it is something that logistic regression will be happy to pick up. And instead of x1
, this approach suggests using x1^1 + x1^0.5 + log(x1)
as a better alternative.