Very high information value - what does it mean?

Question

Hey I used iv function from scorecard package to calculate Information Value of my independent variables. What suprised me is fact that for one of my numerical variable I get information value equal to 4, when there is a rule of thumb that IV higher that 0.5 is suspicious. Is it normal that I get such high IV?

Is this credit scoring and information value as used at https://stats.stackexchange.com/questions/462052/intuition-behind-weight-of-evidence-and-information-value-formula/462445#462445. ? I don't think that is used outside of credit scoring ... so please add that tag. — kjetil b halvorsen, May 24 '21 at 00:56

score 1 · Accepted Answer · answered May 24 '21 at 04:36

1

I'm going to assume you're trying to tackle a credit risk problem.

If you only have a very small dataset, getting an IV above 0.5 is not unlikely. This is because one variable alone may be enough to (almost) perfectly separate good and bad customers.

The reason 0.5 is used as a cut-off is because you may have what's called leakage. Any variable whose IV is above 0.5 may be a proxy for the response variable (whether or not a customer is good or bad).

The 0.5 is just a rule of thumb, and definitely shouldn't be taken as gospel. Think about whether or not it's possible for the variable to have an IV that high without being a proxy for the response variable.

answered May 24 '21 at 04:36

ralph

602
2
8

Yes it's credit risk problem. My dataset is quite large (4k observations). Variables for which I get very high IV (respectively 4.5, 2.5, 2.3, 2.3, 2.1, 1.8) are numerical variables. I don't do bining before calculating IV because `iv` function from `scorecard` library handle also numerical variables. So is it still ok to have such high IV? I only calculate IV to select the variables so I don't know if I should care about it? – Johhn White May 24 '21 at 09:35
Yeah, it's ok to have variables with IVs that high. Just make sure that if the variable makes it into the final model, its not 'leaking' information about the response (not a proxy for the reponse). In industry, it's not uncommon to have well over 500 candidate variables. Like you alluded to, IV is just a way of clearing out 'less useful' variables early on in model development. It's not best approach statistically, but given computational costs of more advance techniques, it gets used the most. – ralph May 24 '21 at 10:41
It's also worth noting that the `iv` function will automatically bin continuous (and even categorical) variables before computing the IVs. – ralph May 24 '21 at 10:43

score 0 · Answer 2 · answered Sep 27 '21 at 08:13

0

If you are using varImp in R, please note that the $IV$ value is somehow rescaled to 100.

answered Sep 27 '21 at 08:13

Jacek Mucha

1

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 27 '21 at 08:15

Very high information value - what does it mean?

2 Answers2