1

I am analyzing data (which I am unable to share), and created several classification models between four classes using the randomForest() function. They are fairly successful - in this example, when fitted on the test set, overall achieved accuracy rate is above 0.88, with each class having an accuracy rate above 0.86.

Attempting to use the plot() function on these models, I always get graphs similar to the one pictured below - similar in that it seems to be that there is always at some point in the graph an error rate of 1.

plot(randomForest)

I thought that this could be an accuracy rate, after all the model has accuracy of 0.95 for 'FVP', but that implies 'Normal' has an accuracy rate of about 0.35, which is not even close.

How do I interpret this graph? If the code for this plot function is bugged, what can I use to visualize anything about the randomForest()object?

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
Ilya K
  • 21
  • 7
  • It's written "Error" in the vertical axis, so it's not accuracy. – Firebug Aug 10 '18 at 16:50
  • Otherwise I suggest you to read `?randomForest:::plot.randomForest` – Firebug Aug 10 '18 at 16:59
  • Yes, it says "Error", but I was considering the possibility of mis-labeling. I read the documentation, it claims these are error rates, which is impossible. – Ilya K Aug 10 '18 at 18:18
  • 1
    Extremely relevant: https://stats.stackexchange.com/questions/348245/do-we-have-to-tune-the-number-of-trees-in-a-random-forest/348246#348246 – Sycorax Aug 10 '18 at 19:39

1 Answers1

1

This plot, without xtest and ytest arguments, shows OOB Error Rates, which can differ dramatically from legitimate test set Error Rates.

Ilya K
  • 21
  • 7
  • Could you elaborate? – rolando2 Aug 11 '18 at 12:33
  • When first creating the model, I used 'Y~.' (where Y is the column I wanted to predict) as the argument "x, formula" (as in the documentation). Random Forest models are made by bagging. When I tried plot(), it gave me the correct error rates, but on the 'bagged' test set that was invisibly made when randomForest() was executed. I realized that there is another way to input the data, the arguments x, y, xtest, and ytest, are: training predictors, training predicted, testing predictors, and testing predicted variables, respectively. Using these inputs, plot() now uses my defined test set. – Ilya K Aug 11 '18 at 23:52
  • @IlyaK It would be best to expand your answer with the material in this comment. – Sycorax Dec 30 '20 at 14:33