Determine accuracy of model which estimates probability of event

Question

I'm modelling an event with two outcomes, a and b. I have created a model which estimates the probability that either a or b will happen (i.e. the model will calculate that a will happen with 40% chance and b will happen with 60% chance).

I have a large record of outcomes of trials with the estimates from the model. I would like to quantify how accurate the model is using this data - is this possible, and if so how?

I might be wrong but I think you're interested in the training- and/or test-error of your model. See, for example: http://www.cs.ucla.edu/~falaki/pub/classification.pdf — Stijn, Jan 03 '12 at 12:52
@Stijn He's predicting the probability though rather than directly classifying as a or b, so I don't think those metrics are what he's asking for. — Michael McGowan, Jan 03 '12 at 14:57
Are you more interested in how well the model will eventually perform for classification (in which case ROC and AUC type of analysis seems most relevant (http://en.wikipedia.org/wiki/Receiver_operating_characteristic)? Or are you more interested in understanding how "calibrated" the probability predictions are (i.e. does P(Outcome = A)=60% really mean 60%, or just that outcome = A is more likely than the other outcomes... — DavidR, Jan 03 '12 at 18:03
It sounds like you want to know about [probability scoring](http://scholar.google.com/scholar?q=probability+scoring&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&um=1&ie=UTF-8&hl=en&sa=N&tab=ws). — whuber, Jan 03 '12 at 18:24
@DavidR, I want to know if P(outcome = a) really is 60% (or whatever) — Peter, Jan 03 '12 at 20:08
@whuber could you give a more precise reference to probability scoring? I’m not sure to go into the good papers. — Elvis, Jan 03 '12 at 20:13
Some related questions include [this](http://stats.stackexchange.com/q/2275/2485) and [this](http://stats.stackexchange.com/q/1875/2485). — Michael McGowan, Jan 03 '12 at 20:19
Elvis, an article in the current issue of [Decision Analysis](http://da.journal.informs.org/content/8/4/256) drew my attention to probability scoring. It appears to build on substantial literature on the topic. (I don't have access to any more than the abstract, though, so I cannot comment on the article itself.) A cover paper by the journal's editors (which is [freely available](http://da.journal.informs.org/cgi/reprint/8/4/251)) mentions a number of previous papers on the same topic. — whuber, Jan 03 '12 at 20:20
Thanks, Michael. Your [first reference](http://stats.stackexchange.com/questions/2275) seems to be an exact duplicate of this question. Perhaps we should merge the two? — whuber, Jan 03 '12 at 20:24
@whuber I don't know what exactly happens in a merge, but I don't want the wording of either to disappear. When one doesn't know of a term (like "scoring rule"), this type of thing can be particularly difficult to search for. As such, letting both exist in some form might be useful to those searching the site. I don't know if merging will achieve that goal or not. — Michael McGowan, Jan 03 '12 at 20:28
In a merge, one of the *questions* will disappear, but all the replies and comments will be collected beneath the remaining question (and will retain their original time stamps, editing histories, votes, etc.). To enhance searching we can further edit the remaining question or even paste key parts of the other question into it. The important criterion is whether both questions have been interpreted the same way by their respondents. If so, a merge is in order; if not, we ought to edit one (or both) to clarify their differences. — whuber, Jan 03 '12 at 20:32

Michael McGowan · Accepted Answer · 2012-01-04T17:31:47.427

Suppose your model does indeed predict A has a 40% chance and B has a 60% chance. In some circumstances you might wish to convert this into a classification that B will happen (since it is more likely than A). Once converted into a classification, every prediction is either right or wrong, and there are a number of interesting ways to tally those right and wrong answers. One is straight accuracy (the percentage of right answers). Others include precision and recall or F-measure. As others have mentioned, you may wish to look at the ROC curve. Furthermore, your context may supply a specific cost matrix that rewards true positives differently from true negatives and/or penalizes false positives differently from false negatives.

However, I don't think that's what you are really looking for. If you said B has a 60% chance of happening and I said it had a 99% chance of happening, we have very different predictions even though they would both get mapped to B in a simple classification system. If A happens instead, you are just kind of wrong while I am very wrong, so I'd hope that I would receive a stiffer penalty than you. When your model actually produces probabilities, a scoring rule is a measure of performance of your probability predictions. Specifically you probably want a proper scoring rule, meaning that the score is optimized for well-calibrated results.

A common example of a scoring rule is the Brier score: $$BS = \frac{1}{N}\sum\limits _{t=1}^{N}(f_t-o_t)^2$$ where $f_t$ is the forecasted probability of the event happening and $o_t$ is 1 if the event did happen and 0 if it did not.

Of course the type of scoring rule you choose might depend on what type of event you are trying to predict. However, this should give you some ideas to research further.

I'll add a caveat that regardless of what you do, when assessing your model this way I suggest you look at your metric on out-of-sample data (that is, data not used to build your model). This can be done through cross-validation. Perhaps more simply you can build your model on one dataset and then assess it on another (being careful not to let inferences from the out-of-sample spill into the in-sample modeling).

Determine accuracy of model which estimates probability of event

1 Answers1

Linked

Related