14

I have the data with categorical variables and continuous variables, but is the need for finding information value in explanatory data analysis.

Just give the reason for why we are calculating the information value for each variables at the beginning of the data analysis and what will be the cutoff point of INFORMATION VALUE for taking in care of the analysis

user43247
  • 429
  • 1
  • 4
  • 8
  • 3
    Please tell us more specifically what calculation "information value" refers to: there does not seem to be a standardized quantitative meaning for that term that all readers will understand in the same way. When you edit your question, please also provide more context to help us understand what kind of analysis you are discussing and what you are using the "cutoff point" for. – whuber Apr 09 '14 at 14:31
  • @whuber: he probably intends the sense used at https://stats.stackexchange.com/questions/462052/intuition-behind-weight-of-evidence-and-information-value-formula/462445#462445 – kjetil b halvorsen May 20 '21 at 05:22

1 Answers1

14

Generally speaking, Information Value provides a measure of how well a variable $X$ is able to distinguish between a binary response (e.g. "good" versus "bad") in some target variable $Y$. The idea is if a variable $X$ has a low Information Value, it may not do a sufficient job of classifying the target variable, and hence is removed as an explanatory variable.

To see how this works, let $X$ be grouped into $n$ bins. Each $x \in X$ corresponds to a $y \in Y$ that may take one of two values, say 0 or 1. Then for bins $X_i$, $1 \leq i \leq n$,

$$ IV= \sum_{i=1}^n (g_i-b_i)*\ln(g_i/b_i) $$

where

$b_i= (\#$ of $0$'s in $X_i)/(\#$ of $0$'s in $X) =$ the proportion of $0$'s in bin $i$ versus all bins

$g_i= (\#$ of $1$'s in $X_i)/(\#$ of $1$'s in $X) =$ the proportion of $1$'s in bin $i$ versus all bins

$\ln(g_i/b_i)$ is also known as the Weight of Evidence (for bin $X_i$). Cutoff values may vary and the selection is subjective. I often use $IV < 0.3$ (as does [1] below).

In the context of credit scoring, these two resources should help:

[1] http://www.mwsug.org/proceedings/2013/AA/MWSUG-2013-AA14.pdf

[2] http://support.sas.com/resources/papers/proceedings12/141-2012.pdf

dmanuge
  • 351
  • 1
  • 7
  • 2
    Do you know of any sort of correction for calculating information value when one of the bins is either all good or all bad? My idea is to add 1 to each column of each bin to correct for this situation. I am wondering if this is a common practice or if there are any other theoretical concerns. I am mostly considering this step out of pragmatism. – Zelazny7 Sep 30 '14 at 14:52
  • 1
    I've seen some practitioners remove the term with all good or all bad from the summation, but I wouldn't recommend this because you'd be essentially nullifying a perfect association. Adding a constant (say c) is an interesting solution, but the choice and constant and size of the bin will greatly affect your IV. As c approaches 0 or bin size approaches infinity, the IV approaches infinity. To obtain a more representative IV, you might want to consider combining adjacent bins that have all goods or all bads. – dmanuge Jan 07 '15 at 00:21