Why the prediction of this Random Forrest model is so poor?

Question

I am using Random Forrest to predict the MRR (Material removal rate). But the predictions have been quite off the mark. Even Linear Regression gave a much better result. I don't know where I'm going wrong. Below is my code in R:

data <-structure(list(A = c(50L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 
                            60L, 60L, 70L, 70L, 70L, 70L, 70L, 80L, 80L, 80L, 80L, 80L, 90L, 
                            90L, 90L, 90L, 90L), B = c(3L, 5L, 7L, 9L, 11L, 3L, 5L, 7L, 9L, 
                                                       11L, 3L, 5L, 7L, 9L, 11L, 3L, 5L, 7L, 9L, 11L, 3L, 5L, 7L, 9L, 
                                                       11L), C = c(100L, 200L, 300L, 400L, 500L, 200L, 300L, 400L, 500L, 
                                                                   100L, 300L, 400L, 500L, 100L, 200L, 400L, 500L, 100L, 200L, 300L, 
                                                                   500L, 100L, 200L, 300L, 400L), D = c(65L, 70L, 75L, 80L, 85L, 
                                                                                                        75L, 80L, 85L, 65L, 70L, 85L, 65L, 70L, 75L, 80L, 70L, 75L, 80L, 
                                                                                                        85L, 65L, 80L, 85L, 65L, 70L, 75L), E = c(0.2, 0.3, 0.4, 0.5, 
                                                                                                                                                  0.6, 0.5, 0.6, 0.2, 0.3, 0.4, 0.3, 0.4, 0.5, 0.6, 0.2, 0.6, 0.2, 
                                                                                                                                                  0.3, 0.4, 0.5, 0.4, 0.5, 0.6, 0.2, 0.3), MRR = c(8.926014, 14.10501, 
                                                                                                                                                                                                   38.40095, 48.49642, 88.21002, 4.892601, 15.179, 26.92124, 38.78282, 
                                                                                                                                                                                                   89.16468, 5.298329, 10.04773, 18.30549, 49.21241, 79.57041, 2.362768, 
                                                                                                                                                                                                   4.868735, 22.52983, 44.8926, 49.06921, 1.312649, 7.207637, 18.61575, 
                                                                                                                                                                                                   25.1074, 48.01909)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                            -25L))

#Splitting the data

library(caTools)

set.seed(123)
split <- sample.split(data$MRR, SplitRatio = 0.7)
training_set <- subset(data, split == TRUE)
test_set <- subset(data, split == FALSE)

#Building the model and making predictions
library(randomForest)
set.seed(123)
rforest <- randomForest(x = training_set[-6],
                         y = training_set$MRR,
                         ntree = 500)

pred_rforest <- predict(rforest, test_set[,1:5])

#Also building a Decision tree model for the prediction
library(rpart)
dtree <- rpart(formula = MRR ~ .,
                  data = training_set,
                  control = rpart.control(minsplit = 1))

pred_dtree <- predict(dtree, test_set[,1:5])

#Checking the accuracy
library(MLmetrics)
MAPE(pred_dtree, test_set[,6])
MAPE(pred_rforest, test_set[,6])

Both results were very bad.

Any help would mean a lot.

score 0 · Answer 1 · answered Mar 27 '21 at 09:49

Your problem is not with the random forest, but with the MAPE.

>  test_set[,6]
[1] 48.496420 88.210020 26.921240  5.298329  2.362768 49.069210  1.312649 25.107400
> pred_rforest
       4        5        8       11       16       20       21       24 
29.96306 34.66228 28.90458 26.67681 19.65220 36.34482 22.29752 36.44267

Several of your predicted-actual pairs have overforecasts by a factor of two, or more. For instance, the second-to-last pair has an actual of 1.312649 and a forecast of 22.29752, for an Absolute Percentage Error of

> (22.29752-1.312649)/1.312649
[1] 15.98666

That is, 1599%. This single misforecast (along with a number of other similar but less blatant cases) will completely dominate all the other errors in the averaging step.

Your random forest, and other methods, are attempting to give you unbiased forecasts of conditional expectation. The key thing to keep in mind is that the MAPE does not reward expectation forecasts. It is optimized by a completely different functional of the (unknown) future density, one which is almost always lower than the expectation. What your MAPEs above 100% are telling you is that you would be "better off" with a flat zero forecast (which would yield 100% MAPE). If this is not what you want, then the MAPE is probably not the tool you need.

If you truly want to get MAPE-minimal forecasts, you need to do some post-processing of your forecasts. Ideally, you would calculate predictive densities, then extract the MAPE minimizing point forecast from the density via numerical minimization or simulation. (To be honest, I have never seen a situation where the actual underlying business problem was best addressed with a MAPE-minimal forecast.)

If you are interested in unbiased expectation forecasts, you should use a different metric, like the (Root) Mean Squared Error.

More information can be found at What are the shortcomings of the Mean Absolute Percentage Error (MAPE)?, as well as in Kolassa (2020, IJF). Your particular case of a high coefficient of variation is discussed in the dice rolling experiment at the linked thread, which in turn is taken from Kolassa & Martin (2011, Foresight).

(I apologize for the shameless self-promotion.)

Thank you very much for your time and answer. And thank you for the link, I will check that. I will try other error estimators too, but my main problem was those wrong prediction it is making. Like one you mentioned in the post or say the predicted value is 34.66 when the actual is 88.21. I am wondering how to better them. — Single Handed Sailor, Mar 27 '21 at 10:08
That may indeed also be worth investigating. If so, you should indeed use the MSE to assess your forecast quality. Regarding this particular example, note that the 88.21 observation is the largest one in the entire sample. The largest observations will always be underpredicted by an expectation prediction, just like the smallest observations will always be overpredicted (see https://stats.stackexchange.com/q/390210/1352). This may also be useful: https://stats.stackexchange.com/q/222179/1352 — Stephan Kolassa, Mar 27 '21 at 10:15

Why the prediction of this Random Forrest model is so poor?

1 Answers1