Multicollinearity and Partial Dependence Questions

Question

Assume I build a binary classification model to predict p(y=1) from {x1, x2, ... x10}

For now, assume that model could be a GBM, RandomForest, or Logistic Regression.

Also assume that all of the independent variables are independent from one another, with the exception that x2 and x3 are strongly positively correlated with each other.

After the model is built, I run Partial Dependence Plots to try and gain some understanding on the variables and the model outcome.

What I've observed is this: the line on the PDP chart for x2 and x3 are going in opposite directions. If I train a new model without using x3, then the PDP line for x2 is increasing. If I train a new model without using x3, then the PDP line for x2 is also increasing. However, when I train a model using both x2 and x3, then the PDP line for x2 is increasing, while the line for x3 is decreasing.

Is this "normal" when 2 variables are strongly correlated with each other?
Does this depend on the underlying model as having an assumption of independence like Logistic Regression? Or can I assume that this effect can happen using algorithms like RandomForest and GBM as well?

I feel like the assumptions behind partial dependence are what is creating this problem - and not the underlying model itself.

If the answers to #1 is "yes", and the answer to #2 is that the underlying algorithm won't matter... Then are there other options to understand the effects (other than just excluding one of the variables).

Multicollinearity and Partial Dependence Questions

0 Answers0