Overall, I'd like to be able to say that, for the logistic prediction for this row, ColA was more influential in driving up the resultant probability (ie, y_hat) than ColB. (We'll use y_hat as it's usually defined for logistic.) But is this possible? Some data scientists I've talked to say yes, but I've also seen push-back.
From what I've read, it seems that GLMs make it easiest to get at a per-row variable importance (see this limited discussion on logit in particular, including push-back). But can they actually do it?
If B1 and B2 are coefficients and the cols in X represent our features, it would seem that if B1*X1 is greater than B2*X2 then B1*X1 would drive the resultant probability towards 1 more than B2*X2. Here's an example (which brings in a factor col, for a full treatment).
We create features X1 and X2, where X1 is random and X2 (I think we can agree) has a large positive impact on y:
set.seed(33)
X1 <- runif(10, 0.0, 1.0)
X2 <- c(1,0,1,0,1,0,1,0,1,0)
y <- c(1,1,1,0,1,0,1,0,1,0)
df <- data.frame(X1,X2,y)
dforig <- df #Need a copy bc multiplying below doesn't work with factors
df$X2 <- as.factor(df$X2)
Now we create the logit model:
fit.logit = glm(
formula = y~.,
data = df,
family = binomial(link = "logit"))
X1 X21
Coefficients: -1.2353 22.0041
Wald statistic: -0.267 0.003
Now if we multiply B1 and B2 by X1 and X2 respectively and print the results:
coefftemp <- fit.logit$coefficients
coefficients <- coefftemp[2:length(coefftemp)] # drop intercept
multiply_res <- sweep(dforig[,1:2], 2, coefficients, `*`)
X1 X2
1 -0.55087679 22.00411
2 -0.48751729 0.00000
3 -0.59755734 22.00411
4 -1.13510089 0.00000
5 -1.04245907 22.00411
6 -0.63908954 0.00000
7 -0.53998690 22.00411
8 -0.42395777 0.00000
9 -0.01916833 22.00411
10 -0.14575621 0.00000
We see that in the rows where X2 = 1 then B2*X2 (ie the second column) is much higher than B1*X1 (ie the first column). So it would seem that we could say that for those rows that X2 would be the dominant feature driving up the resultant prediction towards 1.
If one reverses the y dependency on X2 by replacing zeros for ones in X2, then after doing the multiplication, B2*X2 has a much lower value than B1*X1 when X2 = 1, which makes sense (since X2 now pushes y_hat towards 0 when X2 = 1). Thus, for these rows, X1 is actually more "responsible" for driving y_hat towards 1. (Note that if both results are negative, then the least negative would be the feature more responsible for y_hat being as high as it is.) Because of this, it would seem that this method of per-row feature ranking still works. What am I missing?
In case it helps, the code for the latter (reversed dependency) case is below:
# Reverse y dependency on X2
set.seed(33)
X1 <- runif(10, 0.0, 1.0)
X2 <- c(0,1,0,1,0,1,0,1,0,1)
y <- c(1,1,1,0,1,0,1,0,1,0)
df <- data.frame(X1,X2,y)
dforig <- df #Need a copy bc multiplying below doesn't work with factors
df$X2 <- as.factor(df$X2)
fit.logit = glm(
formula = y~.,
data = df,
family = binomial(link = "logit"))
X1 X21
Coefficients: -1.235 -22.004
Wald statistic: -0.267 -0.003
coefftemp <- fit.logit$coefficients
coefficients <- coefftemp[2:length(coefftemp)] # drop intercept
multiply_res <- sweep(dforig[,1:2], 2, coefficients, `*`)
multiply_res
X1 X2
1 -0.55087679 0.00000
2 -0.48751729 -22.00411
3 -0.59755734 0.00000
4 -1.13510089 -22.00411
5 -1.04245907 0.00000
6 -0.63908954 -22.00411
7 -0.53998690 0.00000
8 -0.42395777 -22.00411
9 -0.01916833 0.00000
10 -0.14575621 -22.00411
Overall, for logistic, can we accurately say (for example) that feature A drives y_hat toward 1 more than feature B, for this individual prediction?
Thanks, all!