11

I am working on developing an insurance risk predictive model. These models are of "rare events" like airline no-show prediction, hardware fault detection, etc. As I prepared my data set, I tried to apply classification, but I couldn't obtain useful classifiers because of the high proportion of negative cases.

I don't have a lot of experience in statistics and modeling data beyond a high school statistics course so I'm kinda confused.

As first thought, I have been thinking of using an inhomogeneous Poisson process model. I classified it based on event data (date, lat, lon) to get a good estimate of the chance of a risk at a particular time on a particular day in particular place.

I'd like to know, what are the methodologies/algorithms to predict rare events?
What do you recommend as an approach to tackle this problem?

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
user3378649
  • 1,107
  • 4
  • 13
  • 22

1 Answers1

9

The standard approach is "extreme value theory", there is an excellent book on the subject by Stuart Coles (although the current price seems rather, err ... extreme).

The reason you are unlikely to get good results using classification or regression methods is that these methods typically depend on predicting the conditional mean of the data, and extreme events are usually caused by the conjunction of "random" factors all aligning in the same direction, so they are in the tails of the distribution of plausible outcomes, which are usually a long way from the conditional mean. What you can do is to predict the whole conditional distribution, rather than just its mean, and get some information on the probability of an extreme event by integrating the tail of the distribution above some threshold. I found this worked well in an application on statistical downscaling of heavy precipitation.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 1
    Is there any implementation of this theory on python ? – user3378649 Apr 20 '14 at 21:06
  • Sorry, I don't program in Python (yet) so I can't help there. – Dikran Marsupial Apr 22 '14 at 09:51
  • Sorry, I don't quite understand your reasoning. Say you have r.v. $y$ and predictors $x_1,\dots, x_n$; you are interested in predicting when $y>Y_0$ which happens rarely. Why can't you fit some standard classification model to estimate conditional probability $P(y>Y_0|x_1,\dots,x_n)$ - say, logistic regression ? If I understand correct, you are saying that modelling conditional mean $E(y|x_1,\dots,x_n)$ doesn't give us useful info about extreme event $y>Y_0$, this is true. But we still can estimate $P(y>Y_0|x1,\dots,x_n)$ using standard classification without Extreme value theory - no? – Kochede Dec 02 '14 at 09:07
  • Yes, you can do that, however the cost function your are minimising is not focussed on getting the tails of the distribution right, so if that is what you are interested in, it is better to try and model the events in the tails more explicitly. – Dikran Marsupial Dec 02 '14 at 13:32