Logistic Regression - Bayesian Approach - Assessing Classification Accuracy

Question

I have recently begun to read about bayesian statistics and I am playing around with the R2WinBUGS package. I'm trying to fit a logistic regression to the spam data (that can be found on the webpage of the elements of statistical learning) using R and WinBUGS. My approach was to first divide the data into 80% training and 20% testing sets. I can fit the model using the 80% set but I dont know how to write WinBugs code to predict on new observations (say the 20% test set) and I wonder if this approach to study model/classification precisicion make sense in a Bayesian Approach?

Erik · Accepted Answer · 2013-02-11T16:25:02.587

Predicting with bayesian models and especially BUGS is very easy. Just set the response in testing sets to NA. Then you also need to specifiy initial values for the response; set those to NA for the training set and to a reasonable value for the test data.

BUGS will then sample from the posterior predictive distribution for the response values you set to NA. Note that these distributions contain the uncertainty about the regression coefficients. You can take the median of these samples if you want point estimates, but the sd of the estimates will also be quite informative.

Here is a rather minimal example:

model
{
    for (i in 1:N)
        {
            y[i] ~ dnorm(mu,1)
        }
        mu ~ dunif(-1000,1000)
}

#data
list(N=10, y = c(-1,0,1,-1,0.5,-0.5,2,-1.5, NA, NA))
#inits
list(mu = 0, y = c(NA,NA,NA,NA,NA,NA,NA,NA,0,0))

You can then get posterior predictive distributions for $y_9$ and $y_{10}$. This example does not contain predictors, but it also works with them. Note that you would not set them to NA, they would instead remain unchanged.

@Edit after Comment:

You can also do this differently and seperate test and training data in the model above. This would look like this:

model
{
    for (i in 1:N.train)
        {
            y.train[i] ~ dnorm(mu,1)
        }
    for (i in 1:N.test)
        {
            y.test[i] ~ dnorm(mu,1)
        }
        mu ~ dunif(-1000,1000)
}

#data
list(N.train=8, N.test = 2, y.train = c(-1,0,1,-1,0.5,-0.5,2,-1.5))
#inits
list(mu = 0, y.test = c(0,0))

This might look somewhat easier, but note that you will also need to split any predictor in the models (my example has none). You might have vectors like sex.train and sex.test then. Personally I prefer the first way, because it is more terse.

And yes, I think this is a reasonable starting point. While some sorts of overfitting will be indicated in a bayesian model by very high sds for the coefficients, you still impose a model structure which might not fit the data well. This can also lead to your predictions being poor. You should also consider (for example) a full cross validation, where you will repeat that step with different splits of the original data.

Just beware of data splitting in general as it is very inefficient. — Frank Harrell, Feb 11 '13 at 14:56
@FrankHarell I agree. But fitting on one part of the data and predicting on the other is obviously also part of better methods like cross validation. I will add a note to the effect. I soft of focused on the technical aspect at first. — Erik, Feb 11 '13 at 15:12
Why is data splitting very inefficient? Is there anyway to do this separately? as in, load two separate data sets (the "train" and "test" sets). #data list(N=8, y = c(-1,0,1,-1,0.5,-0.5,2,-1.5)) #data list(N=2, y = c(NA, NA)) Or something like that? — JEquihua, Feb 11 '13 at 16:16
Simple data splitting is inefficient if you want to estimate the performance of your model. CV and bootstrapping usually yield better estimates. This is a general fact and not specific to bayesian modelling and/or WinBugs. When cross validating you would still do what you propose but you would do it with different splits of the data. I will add a point to adress your second question in my answer. — Erik, Feb 11 '13 at 16:21
"Inefficient" is in the statistical sense (i.e., low precision/high standard error) not in the computational sense. 100 repeats of 10-fold cross-validation, like the bootstrap, can be quite efficient. — Frank Harrell, Feb 12 '13 at 04:11

Logistic Regression - Bayesian Approach - Assessing Classification Accuracy

1 Answers1