Multinomial probit: can covariance of coefficients be calculated from predicted probabilities?

Question

When producing a GLM (generalized linear model), one usually wants to have an estimate of the variance-covariance matrix of the fitted coefficients, which happens to have a closed form solution with a peculiar structure.

If we denote the features/covariates as $\mathbf{X}_{m \times n}$ and the obtained coefficients as $\mathbf{\beta}_{n}$, then we get: $$ \text{Cov}(\mathbf{\beta}) = ( \mathbf{X}^T \mathbf{D} \mathbf{X} )^{-1} $$

Where $\mathbf{D}_{m \times m}$ is a diagonal matrix containing at each entry the variance of the prediction for each observation. For example, for a plain Gaussian model, we assume that residuals have a constant variance and thus $\mathbf{D}$ is the identity matrix scaled by some constant.

For binary-outcome models such as logistic and probit regressions, since the prediction is a probability, each entry in $\mathbf{D}$ can be approximated by: $$ D_{ii} = P_i (1 - P_i) $$

Where $P_i = f(\mathbf{x}_i \mathbf{\beta})$ with $f(.)$ being the link function. In the case of a binary probit model, we use the CDF (cumulative distribution function) of the normal distribution as link function, thus: $$ D_{ii} = P_i (1 - P_i) = \mathbf{\Phi}(\mathbf{x}_i \mathbf{\beta}) (1 - \mathbf{\Phi}(\mathbf{x}_i \mathbf{\beta})) $$

And we can thus easily obtain the variance-covariance matrix of $\mathbf{\beta}$ for a binary probit model.

Suppose now that we have a multinomial probit model, in which there are probabilities for more than one class, with a more complex link function. In the multinomial probit setting, we have $\{\mathbf{\beta}_k\}$ coefficients for more than one class (let's denote $\mathbf{B}$ as a matrix with those coefficients stacked on top of each other), and the link function instead corresponds to the probability that each class' corresponding score would be the highest value in a vector drawn out of a multivariate normal distribution with means centered around the product of coefficients and covariates. In latent-variable form, we have: $$ \mathbf{Y}_{m \times k} \sim N(\mathbf{X}_{m \times n} \mathbf{B}_{n \times k}, \mathbf{\Sigma}_{k \times k}) $$

and for a given observation with predictions $\mathbf{y}$ for each of the $k$ classes, we have: $$ P_{i,k} = P(y_{i,k} > y_{i,1} \:\wedge\: y_{i,k} > y_{i,2} \:\wedge\: \:...\: \:\wedge\: y_{i,k} > y_{i,k-1} \:\wedge\: y_{i,k} > y_{i,k+1} \:\wedge\: \:...) $$

Let's leave aside the issue of calculating $P_{i,k}$ and the identifiability and estimation of $\mathbf{B}$ and $\mathbf{\Sigma}$, and assume we have already calculated $\mathbf{P}_{m \times k}$ for all classes for each observation.

From a logical POV, in the earlier case, changing the value of a given coefficient has an effect on the expected value of each other coefficient given by some pairwise relation between them, but come to think of the multinomial probit model, there is an additional effect in that changing the value of a given coefficient in a given class also changes the expected values of the coefficients of other classes, which in turn change the expected values of the other coefficients in the former class, involving many relationships at once between each pair of coefficients within a given class.

Now the question: if we take a given class $k$ with coefficients $\mathbf{\beta}_k$ (or coefficients of difference w.r.t. base class), would it be correct to conclude that the variance-covariance matrix of that subset of coefficients is given by the same $(\mathbf{X}^T \mathbf{D} \mathbf{X})^{-1}$ structure, with $\mathbf{D}$ being calculated as $P (1 - P)$ from the $P_{i,k}$ for that class? Or is it now incorrect due to the correlations with the coefficients for the other classes?

Multinomial probit: can covariance of coefficients be calculated from predicted probabilities?

0 Answers0