scikit-learn IsolationForest anomaly score

Question

According to IsolationForest papers (refs are given in documentation) the score produced by Isolation Forest should be between 0 and 1.

The implementation in scikit-learn negates the scores (so high score is more on inlier) and also seems to shift it by some amount. I've tried to figure out how to reverse it but was not successful so far. The code has some methods and attributes like score_samples() and self.offset_ that are not accessible from a fitted object. The documentation and comments in code on usage of self.contamination seem contradictory...

I have version 19.1 of scikit-learn (can't tell if there were significant changes in IsolationForest implementation since then)

Any ideas/suggestions would be appreciated!

Tom M. · Accepted Answer · 2019-06-14T13:54:37.137

So the code that corresponds to IsolationForest in 0.19.1 can be found here. This makes your problem a lot more manageable and a lot less confusing since what currently lives on sklearn's master branch is quite different from the 0.19.1 release.

In this version, we can recover the underlying scores directly, since decision_function gives them to us like this:

/// do the work for calculating the path lengths across the ensemble ///
scores = 2 ** (-depths.mean(axis=1) / _average_path_length(self.max_samples_))
return 0.5 - scores

scores is calculated exactly as you'd expect from the original paper. To recover what we want, we simply have to do the following:

model = sklearn.ensemble.IsolationForest()
model.fit(data)
sklearn_score_anomalies = model.decision_function(data_to_predict)
original_paper_score = [-1*s + 0.5 for s in sklearn_score_anomalies]

Very important note going forward: this will not be default behavior for decision_function in future releases of scikit-learn, so assess the docs for later releases to see what you have to do to recover the original score from your model!!

Hope this helped!

Thank you! This is certainly a simpler situation. I guess I didn't realize that master is not the latest stable version (which is 0.19.1) — DAF, Jun 12 '18 at 20:22

score 2 · Answer 2 · answered Dec 06 '19 at 12:26

2

A shortcut for tm1212 answer could be:

model = sklearn.ensemble.IsolationForest()
model.fit(data)
sklearn_score_anomalies = abs(model.score_samples(data_to_predict))

answered Dec 06 '19 at 12:26

PV8

226
1
9

1

Nice find! I think this will work for anyone using version 0.22 or later – Tom M. Dec 06 '19 at 14:57

Joey Gao · Answer 3 · 2021-01-25T01:41:04.097

-2

the larger of the score, it will be more likely a inlier.
so you can normalized to a probability by

predict = model.score_samples(X)
proba= (predict-predict.min())/(predict.max()-predict.min())
proba = 1-proba

edited Jan 25 '21 at 01:41

answered Dec 26 '20 at 01:11

Joey Gao

99
2

The min-max normalization scheme will give a value between 0 and 1 for any input as long as you have at least 2 different values. But is this the same scale referred to in the Isolation Forest papers, as OP asks? If so, can you edit your answer to explain how this will produce the same values as articulated in the paper? – Sycorax Jan 25 '21 at 02:26

scikit-learn IsolationForest anomaly score

3 Answers3