Quantifying uncertainty when fitting a statistical model to partial effects/dependencies from a random forest (or other machine learning model)

Question

Question: I estimate the partial dependence of $y$ on one predictor in a fitted random forest (RF). I want to now fit a parametric model to this partial dependence. How can I estimate my uncertainty when fitting this statistical model to the partial dependence estimated from the RF?

To flesh this out with an example: suppose that plant height is influenced by light, rainfall and pH, all in a nonlinear manner. I fit an RF (or other machine learning model) with height being predicted by all three. If I want to understand how light alone affects height, I can estimate its partial effect (or equivalently, the partial dependence of height on light) from the fitted RF.

Suppose that I know what this shape should look like and have an equation to describe it. I would like to fit this equation to the partial dependence estimated from the RF. Loosely speaking, I am trying to 'filter' the RF's estimated partial dependence through the equation, which represents our prior understanding based on many earlier studies. I am using an RF instead of a fully parametric model because I do not know precisely how the other variables (rainfall, pH) affect height.

How can I go about estimating these parameter values in a way that captures the uncertainty in (i) the data and (ii) the fitted random forest?

I encountered a version of this idea in a post on Andrew Gelman's stats blog. According to Gelman - who focusses on predictions from the whole model, not partial effects/dependencies - the idea has not really been developed.

I suspect that there is a bootstrap-based solution to this, but I am unsure. There may be simpler solutions that work more directly from the fitted random forest, but I am unaware of them because of an incomplete understanding of how partial effects are calculated. I'd appreciate any suggestions.

This is a quite complex mixture of three separate things. The first is how to deal with the "partial" input variance. If it were a function, you could just integrate over the other variables -- but this is probably not possible with your forest description. The other two are also complicated, the uncertainties in data and the description. Are you trying to catch systematic and statistical uncertainties? The statistical uncertainties from the data you could estimate from the input distributions. IDK about the forest uncertainties -- this is method dependent. — cherub, Jun 22 '18 at 12:08
@cherub Thanks for the thoughts. Yes, I'm trying to understand how to quantify the overall uncertainty in the partial effects that arises from both sources together i.e. it's not important to me to separate the two. — mkt, Jun 25 '18 at 11:21
I agree that the questions are interesting, but a complete answer requires more time than I currently have. The part about the marginalization is definitely a different question than the one about the propagation of uncertainty. — cherub, Jun 26 '18 at 13:43
I think a core issue is that the PDP themselves, especially at the edges of their support are rather bad approximations. So the bootstrap, might still sample something that is biased itself. — usεr11852, Feb 07 '20 at 17:01

Quantifying uncertainty when fitting a statistical model to partial effects/dependencies from a random forest (or other machine learning model)

0 Answers0