Intuition behind Weight of Evidence and Information Value formula

Question

In credit scoring models, we use Weight of Evidence to create bins for continuous variables and Information value to filter out important variables. \begin{align} \text{WoE:} \qquad &\ln \frac{\text{Distr Good}}{\text{Distr Bad}} \cdot 100 \\[10pt] \text{IV:} \qquad &\sum_{i=1}^n \left( \text{Distr Good}_i - \text{Distr Bad}_i \right) \cdot \ln \frac{\text{Distr Good}_i}{\text{Distr Bad}_i} \end{align} Can someone help explain the intuition behind these formulas?

Instead of adding formulas as images, please try to use LaTeX next time (now I have changed it). — oszkar, Apr 22 '20 at 10:52

kjetil b halvorsen · Accepted Answer · 2020-05-01T16:15:55.883

It can be difficult to find sources giving precise definitions and good explanations of these concepts ... there is one R package at CRAN woe with a function woe one can check, and I found this paper which at least gives precise definitions. So, assume we have a binary response $Y$ and a grouped predictor $x$. As this seems to be used in credit scoring, the binary outcomes is usually called bad and good, but we will also use 0 and 1. Which is good and which bad do not matter for the formulas, because they are invariant under switching of the labels. The formulas express a comparison divergence of two distributions, the distributions of $x$-labels among the goods, denoted $g_i/g$ and of labels among the bads, $b_i/b$ ($g=\sum_i g_i, b=\sum_i b_i$).

Then we have $$ \text{woe}_i = \log\left( \frac{g_i/g}{b_i/b} \right) $$ where $i$ represents the classes defined by $x$. As $\left( \frac{g_i/g}{b_i/b} \right)$ is a ratio of two probabilities, it is a risk ratio (RR). If $\text{woe}_i$ is large positive, it means that in the group $i$ the goods are more frequent than in the full sample (or population, if we have population data), if large negative, bads are overrepresented. If zero the group has the same distribution as the full sample$^\dagger$.

Then for information value: $$ \text{IV} = \sum_i \left( \frac{g_i}{g}-\frac{b_i}{b} \right)\cdot \text{woe}_i $$ It is not obvious at a first glance how to interpret this. It turns out that this is a symmetrized Kullback-Leibler divergence, called the J-divergence (or Jaynes-divergence). Let us show this. Now write $p_i, q_i$ for the two distributions. The Kullback-Leibler divergence see Intuition on the Kullback-Leibler (KL) Divergence is given by $$ \DeclareMathOperator{\KL}{KL} \KL(p || q)= \sum_i p_i \log\frac{p_i}{q_i} $$ which is nonnegative, but not symmetric. To symmetrize it, take the sum \begin{align} \KL(p || q)+\KL(q || p) &=\sum_i p_i \log\frac{p_i}{q_i}+\sum_i q_i \log\frac{q_i}{p_i}\\[8pt] &= \sum_i p_i \log\frac{p_i}{q_i} - \sum_i q_i \log\frac{p_i}{q_i}\\[8pt] &= \sum_i (p_i-q_i) \log\frac{p_i}{q_i} \end{align} (where we used that $\log x^{-1} =-\log x$) and this can now easily be recognized as the information value $\text{IV}$.

A warning: These concepts seem to be much used in the context of univariate screening of variables to use in logistic regression models. That is generally not a good idea, for discussion see How come variables with low information values may be statistically significant in a logistic regression?.

A prototype implementation in R to experiment with:

library(tidyverse)

myWoE  <- function(data) { # data frame with cols x, y
    woetab <- data %>% group_by(x) %>%
        summarise(total=n(), good=sum(y), bad=sum(1-y) ) %>%
        mutate(gi = good/sum(good),
               bi = bad/sum(bad),
               woe = log(gi/bi),
               iv  = (gi - bi)*woe )
    woetab
    }

some test data:

test <- data.frame( x= rep(1:5, each=10), 
                    y= rep(rep(0:1, each=5), 5))# some very uninformative data     
test2 <- data.frame( x=rep(1:5, each=20),
                     y=rbinom(5*20, size=1, p=rep(seq(from=1, to=9, length.out=5)/10, each=20)) )# More informative

then run and compare the outputs (not included here):

library(woe)
myWoE(test)
woe::woe(test, "x", FALSE, "y", Bad=0, Good=1, C_Bin=5)

myWoE(test2)
woe::woe(test2, "x", FALSE, "y", Bad=0, Good=1, C_Bin=5)

$\dagger$: This definition differs from the one used in information theory, used for instance in this classical book by IJ Good and discussed by CS Peirce in this classic 1878 paper. There is some discussion of that here.

Intuition behind Weight of Evidence and Information Value formula

1 Answers1

Linked

Related