13

According to IsolationForest papers (refs are given in documentation) the score produced by Isolation Forest should be between 0 and 1.

The implementation in scikit-learn negates the scores (so high score is more on inlier) and also seems to shift it by some amount. I've tried to figure out how to reverse it but was not successful so far. The code has some methods and attributes like score_samples() and self.offset_ that are not accessible from a fitted object. The documentation and comments in code on usage of self.contamination seem contradictory...

I have version 19.1 of scikit-learn (can't tell if there were significant changes in IsolationForest implementation since then)

Any ideas/suggestions would be appreciated!

Sycorax
  • 76,417
  • 20
  • 189
  • 313
DAF
  • 167
  • 1
  • 1
  • 10

3 Answers3

15

So the code that corresponds to IsolationForest in 0.19.1 can be found here. This makes your problem a lot more manageable and a lot less confusing since what currently lives on sklearn's master branch is quite different from the 0.19.1 release.

In this version, we can recover the underlying scores directly, since decision_function gives them to us like this:

/// do the work for calculating the path lengths across the ensemble ///
scores = 2 ** (-depths.mean(axis=1) / _average_path_length(self.max_samples_))
return 0.5 - scores

scores is calculated exactly as you'd expect from the original paper. To recover what we want, we simply have to do the following:

model = sklearn.ensemble.IsolationForest()
model.fit(data)
sklearn_score_anomalies = model.decision_function(data_to_predict)
original_paper_score = [-1*s + 0.5 for s in sklearn_score_anomalies]

Very important note going forward: this will not be default behavior for decision_function in future releases of scikit-learn, so assess the docs for later releases to see what you have to do to recover the original score from your model!!

Hope this helped!

Tom M.
  • 413
  • 4
  • 7
  • Thank you! This is certainly a simpler situation. I guess I didn't realize that master is not the latest stable version (which is 0.19.1) – DAF Jun 12 '18 at 20:22
2

A shortcut for tm1212 answer could be:

model = sklearn.ensemble.IsolationForest()
model.fit(data)
sklearn_score_anomalies = abs(model.score_samples(data_to_predict))
PV8
  • 226
  • 1
  • 9
-2

the larger of the score, it will be more likely a inlier.
so you can normalized to a probability by

predict = model.score_samples(X)
proba= (predict-predict.min())/(predict.max()-predict.min())
proba = 1-proba
Joey Gao
  • 99
  • 2
  • The min-max normalization scheme will give a value between 0 and 1 for any input as long as you have at least 2 different values. But is this the same scale referred to in the Isolation Forest papers, as OP asks? If so, can you edit your answer to explain how this will produce the same values as articulated in the paper? – Sycorax Jan 25 '21 at 02:26