Predicting Probabilities of non-binary outcomes with multivariate adaptive regression splines (MARS) in R

Question

I have three non-binary, discrete variables X0.1, X0.12 and X0.15. I want to model X0.1 as a MARS model of X0.12 and X0.15, then using the model answer queries of the form P(X0.1=x|X0.12=y, X0.15=z). This is what I have tried so far :

> mars1 <- earth (X0.1 ~ X0.15 + X0.12, data=d)
> mars1$coefficients
                  X0.1
(Intercept)  1.2392880
h(1-X0.12)  -0.8291468
h(X0.15-1)  -0.3891442
> predict(mars1, data.frame(X0.15=0, X0.12=1), type="response")
         X0.1
[1,] 1.239288

This is clearly not a probability. I have also tried converting my variables into factors.

> a <- factor(d$X0.1)
> b <- factor(d$X0.12)
> c <- factor(d$X0.15)
> mars1 <- earth (a~b+c , glm=list(family=poisson), data=d)
> mars1$coefficients
                     0           1          2
(Intercept)  0.7890499 0.013295241  0.1976548
b1          -0.4273462 0.001813694  0.4255325
b2          -0.3652277 0.091872210  0.2733555
c2           0.1230729 0.105687002 -0.2287599

predict(mars1, data.frame(b=factor("2"),c=factor("2"), levels=levels(c)), type="response")
             0         1         2
[1,] 0.5348205 0.3321178 0.2452333
[2,] 0.5348205 0.3321178 0.2452333
[3,] 0.5348205 0.3321178 0.2452333

The row sums of the prediction are not 1, so these can't be the probabilities either. What exactly do these numbers mean?

Where am I going wrong? Is MARS the wrong tool for this? I want a multinomial logistic regression setting such that interaction terms are automatically detected (I intend to run this on bigger sets of variables).

score 0 · Accepted Answer · answered Jul 29 '20 at 01:55

First a bit of background. Earth expands your three-level factor response into three indicator columns. (To see this, use trace=1 in the call to earth.)

You may have seen in the earth vignette Chapter 4 Generalized Linear Models (classification models) that there is a similar example and discussion:

   multinom.mod <- earth(pclass~., data=etitanic,
                         glm=list(family=binomial), trace=1)

The row sums of the prediction are not 1, so these can't be the probabilities either. What exactly do these numbers mean?`

For this multinomial model, earth first builds a standard three response MARS model. Then using the basis functions from that model, earth builds three completely independent binomial/poisson GLM models, one for each indicator column. (There's a for loop in the earth code that calls glm for each column of the response.)

When building these GLM models, earth doesn't know that there is a relationship between the columns; it doesn't know that the predicted row sums should be 1.

So each figure in the predicted response is indeed a probability, but estimated only with the single GLM model for the indicator column for that factor level.

Where am I going wrong? Is MARS the wrong tool for this? I want a multinomial logistic regression setting such that interaction terms are automatically detected (I intend to run this on bigger sets of variables).

MARS/earth is certainly suitable for this kind of problem. Have a look at the above-mentioned chapter in the earth vignette.

Also I think you should maybe put some thought into what is the underlying type of your variables (on both the lhs and rhs of the formula), and convert your variables to ordered/unordered factors where appropriate before calling a modeling function like earth. This will help the modeling function use the variable correctly.

Predicting Probabilities of non-binary outcomes with multivariate adaptive regression splines (MARS) in R

1 Answers1