6

I fail to understand how to define the prior distribution for a multinomial regression.

  • In what unit the prior probability should be set given that the response don't really have units (but just categories)? Should it be a probability, or maybe a log of odds?
  • Does the dimensions of the prior depend on the number of levels in the categorical response variable?
  • What is the meaning of the variance-covariance matrix for a categorical response variable?

Here is a subset of my data (where the variable names and outcomes have been renamed). An example of working MCMCglmm on these data would be awesome!

df=read.table(text="y     x1       x2
    1  yellow 106.00 6.190476
    2  yellow 120.00 5.254762
    3  yellow  57.00 6.202381
    4  yellow 115.33 5.652381
    5  yellow 175.00 6.154762
    6  yellow  74.00 8.285714
    7  yellow 104.67 3.766667
    8  yellow  95.50 7.976190
    9  yellow 108.00 8.792857
    10 yellow 121.33 7.935714
    11 yellow  66.67 6.969048
    12 yellow  30.00 7.333333
    13 yellow  45.00 6.811905
    14 yellow  70.00 7.550000
    15 yellow  48.00 7.316667
    16 yellow 211.00 4.650000
    17 yellow  69.00 8.369048
    18 yellow 110.50 6.621429
    19 yellow 203.00 6.095238
    20 yellow  75.33 8.211905
    21 yellow 207.33 6.211905
    22 yellow  54.00 7.961905
    23 yellow  74.00 7.019048
    24 yellow 113.00 4.221429
    25 yellow  23.00 7.942857
    26 yellow  80.00 7.511905
    27 yellow 257.00 7.878571
    28 yellow 211.00 7.754762
    29 yellow  99.00 8.016667
    30 yellow 120.00 7.728571
    31 yellow 222.50 5.840476
    32 yellow  44.00 4.209524
    33 yellow  63.00 6.614286
    34 yellow  57.00 8.669048
    35 yellow 223.33 7.033333
    36 yellow 128.00 6.754762
    37 yellow 128.00 5.561905
    38 yellow 121.00 7.471429
    39 yellow  70.00 7.445238
    40 yellow  85.67 5.261905
    41 yellow 113.33 8.509524
    42 yellow  82.00 6.697619
    43    red 207.33 4.180952
    44    red 167.67 5.302381
    45    red 366.50 7.102381
    46    red 230.00 4.942857
    47    red 201.00 5.754762
    48    red 226.00 9.076190
    49    red 193.33 7.066667
    50    red 170.00 7.314286
    51    red 361.33 7.502381
    52   blue 154.00 4.342857
    53    red 199.33 6.361905
    54   blue  97.00 7.750000
    55   blue  82.33 6.209524
    56   blue  55.67 5.321429
    57   blue  47.50 5.911905
    58   blue  15.67 7.185714
    59   blue  96.50 6.452381
    60   blue 202.33 8.576190
    61   blue 157.00 6.669048
    62   blue 117.33 5.828571
    63   blue 105.67 8.485714
    64   blue 108.67 5.714286
    65   blue 296.67 5.852381
    66   blue 206.50 6.826190
    67   blue  88.50 6.178571
    68   blue 163.00 7.833333
    69   blue 151.50 8.983333")

and here is a MCMCglmm call for which the default priors lead to an error message

set.seed(12)
m = MCMCglmm(y ~  -1  + trait:(x1) + trait:(x2)  , rcov = ~ us(trait):units,
 data = df, family = "categorical", verbose = TRUE, burnin = 8000,
 nitt = 40000, thin = 50)

ill-conditioned G/R structure (CN = 24007848728601288.000000):
use proper priors if you haven't or rescale data if you have
Remi.b
  • 4,572
  • 12
  • 34
  • 64
  • Normally (pun intended) I'm inclined to start with Gaussian priors for all parameters, including categorical. This just means that your uncertainty around the initial guess is governed by a normal distribution with a specific variance (indicating the degree of belief). Unless you have specific indicators that the distribution of values for that model parameter follow a specific distribution (long tailed, one-sided, etc.) that's a good place to start. – T3am5hark Oct 24 '16 at 17:34
  • Note that use of Bayesian priors is a form of regularization. If your prior covariance matrix is well-conditioned, then you won't run into issues in the solver with conditioning issues (which tend to happen when the Hessian of the log posterior distribution is ill-conditioned). – T3am5hark Oct 24 '16 at 18:08
  • I appreciate your comments but they are already too advanced for me :) When talking about a prior for a categorical response I don't even understand what unit goes on the x axis of a prior. I am also confused whether the size of the variance covariance matrix should depends on the number of levels in the categorical response variable. I think an explained example on the dummy data I suggest would help – Remi.b Oct 24 '16 at 18:57
  • Bayesian priors reflect your belief, a priori, about the parameters you're estimating (before you've looked at the specific data). That belief is conveyed as a distribution on the parameter values. That is true for both categorical and continuous variables. The x-axis reflects the value of the parameter to be estimated (associated with the categorical variable) and the y-axis is the degree of belief (probability) – T3am5hark Oct 24 '16 at 19:42
  • More simply: your categorical variable in the model is a regressor or explanatory variable. The associated estimated parameter (or regressand) is the thing that gets the prior. The prior distribution reflects your beliefs about what the parameter should look like (x axis is the value of the parameter, y is the probability of the parameter taking on that value). – T3am5hark Oct 24 '16 at 19:46
  • @T3am5hark So, is the x-axis is some kind of k-way log of odds when the response is categorical (where k is the number of levels in the categorical variable)? Or maybe just a probability of obtaining a given outcome? And so the variance-covariance matrix for the prior for the residual variance should be of the same dimension as the number of levels in the categorical response variable, is that right? – Remi.b Oct 24 '16 at 19:49
  • @T3am5hark I do not understand your first sentence actually. I would have thought of the categorical variable (`color`) as being a response in my model. What are the "associated estimated parameter"? Are you referring to `x1`, `x2`, `x3` and `x4`? – Remi.b Oct 24 '16 at 20:36
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/47346/discussion-between-t3am5hark-and-remi-b). – T3am5hark Oct 24 '16 at 20:46

1 Answers1

6

Good question! I've actually tried to get my mind around the same for quite some time, so I'll just share my experiences. I'm still a novice in this area, so I hope I haven't made any mistakes in notation.

always, and I mean always, standardise your continous variables to have mean=0 and sd=1 (or even sd=2). Look into some of Andrew Gelman's blog posts or articles about this. Just google Andrew Gelman standardization, there are a lot of good papers and posts.

Think of your coefficients as log(odds-ratios) (in reference to a reference category, explanation follows). For an in-depth discussion, see this answer. Andrew Gelman also has some recommendations on priors, such as the cauchy, or normal(0,1). His papers are about logistic regression, but I find that these recommendations also extend to multi-outcome regression.

The dimension of the prior indeed depends on the number of outcomes. If you have three outcomes, you have these three linear dependencies:

$ y_1 = \beta_{0,1} + \sum_i\beta_{i,1}*x_{i,1} $ and $ y_2 = \beta_{0,2} + \sum_i\beta_{i,2}*x_{i,2} $ and $ y_3 = \beta_{0,3} + \sum_i\beta_{i,3}*x_{i,3} $

The second subscript denotes the outcome. Normally you would put a prior on each of these coefficients. I'll illustrate with the intercept, $\beta_{0,k}$,

$\beta_{0,1} \sim \mathcal{N}(0,1)$ and $\beta_{0,2} \sim \mathcal{N}(0,1)$ and $\beta_{0,3} \sim \mathcal{N}(0,1)$

Just to complete the math, the probability of outcome $k$ is,

$p_k = \frac{y_k}{\sum_k y_k}$ and the likelihood is $y \sim \text{Multinomial}(p)$

Please note that you would often force one of the outcomes to be the "reference" outcome due to identifiability issues. Wiki has a detailed description of why. Practically, this means you force the coefficients of the reference category to be 0, and thus $p_{reference} = 1$, since $exp(0) = 1$.

I have never used MCMCglmm so I cannot answer anything specific to that, I'm afraid.

demodw
  • 449
  • 2
  • 8