2

I am trying to mimic partial dependence plots as explained in R's pdp package (https://journal.r-project.org/archive/2017/RJ-2017-016/RJ-2017-016.pdf). The second page (page 422) of the document describes a loop to calculate partial dependence for a specific variable.

I tried to mimic this on the PimaIndiansDiabetes dataset in R, focusing on the pregnant variable. pregnant ranges from 0 to 17. Here's my code:

library(mlbench)
library(pdp)
library(caret)
library(randomForest)

data("PimaIndiansDiabetes")

#Create 0/1 for target variable
dmy <- dummyVars(~., data=PimaIndiansDiabetes, fullRank = T)
pid <- as.data.frame(predict(dmy, PimaIndiansDiabetes))
pid$diabetes.pos <- as.factor(pid$diabetes.pos)

#Build random forest model
set.seed(4)
model <- randomForest(diabetes.pos~., data=pid, ntree=10, mtry=3)

#Create partial dependence plot for 'pregnant' using pdp package
pdp.pregnant <- partial(model, pred.var="pregnant", prob=T)

#Create partial dependence plot for 'pregnant' using pdp document methodology
pid2 <- pid
part.pregnant <- data.frame(pregnant=numeric(18), yhat=numeric(18))
for(i in 0:17){
  pid2$pregnant <- i
  part.pred <- as.numeric(as.character(predict(model, pid2)))
  part.pregnant[i+1, 1] <- i
  part.pregnant[i+1, 2] <- mean(part.pred)
}

#Plot both - they are different
plot(pdp.pregnant$pregnant, pdp.pregnant$yhat, type='l')
plot(part.pregnant$pregnant, part.pregnant$yhat, type='l')

enter image description here enter image description here As you can see, the plots are different. So my questions are: 1) Why are they different? Did I miss something in my implementation of the methodology? 2) How am I supposed to interpret the effect of pregnant on diabetes.pos according to the plot from the pdp package?

Gaurav Bansal
  • 633
  • 1
  • 6
  • 16
  • It's a classification problem. `partial` uses by default `prob = FALSE` to use centered logit as response scale. Setting `prob = TRUE` uses a probability scale. – Michael M Jan 19 '18 at 15:36
  • Thanks, I changed the code above to include `prob=T` and put in the new plot. The plots are still different, in fact the are almost mirror images of each other. Does this mean the `partial` function is just treating the 0/1 prediction in the opposite way I am? Is the correct interpretation of the partial dependence of the `pregnant` variable that as `pregnant` goes up, the chance of `diabetes.pos` goes up? – Gaurav Bansal Jan 19 '18 at 15:55
  • 1
    I just added an implementation that matches `pdp` for regression problems at https://stats.stackexchange.com/q/50560/99938 if you're interested. – alexpghayes Aug 27 '18 at 02:14
  • you can change the class using `which.class = 2` which sets the predicted class to the 2nd factor rather than the first. they're usually ordered alphabetically. – pedram Apr 06 '19 at 19:34

0 Answers0