Simple model selection example in PYMC

Question

I am currently experimenting with PYMC and I am trying out a simple example so that I start learning how things work (I am also a Python beginner, but an experienced machine learner).

I have set myself the following very simple model selection task:

Imagine we have a dataset that has been entirely generated either from model 0 or model 1 (not a mixture).

Model 0 is a normal distribution with mean m0 and standard deviation s0.
Likewise, model 1 is also a normal distribution with mean m1 and standard deviation s1.
We assume that we know all parameters m0,s0,m1,s1.
We postulate a hidden binary random variable V which indicates the model that really generated the dataset.

What we would like to do is infer the posterior of V.

I have made an attempt in the code below, but unfortunately after trying for a couple of days, I capitulate and throw myself at your feet for assistance.

The example is of course very simple and may appear contrived. What I would like to do in the future, is use PYMC for model selection. Hence, how one correctly handles a hidden indicator variable like V is important to me. Unfortunately, in my code I keep getting weird results for the posterior of V.

I hope the code below makes sense. I thank you all for your time.

from pymc  import *
import matplotlib 
from pylab import hist, show, plot, title, figure
import numpy as np
from numpy import *


#-----------------------------------------------
# Generate some data
#-----------------------------------------------

# Define means for gaussians
mu0 =  5.0
mu1 = -5.0

# Define standard deviations for gaussians 
s0 = 1.0
s1 = 1.0


# This variable chooses from which model we generate the observed data.
# It can be either 0 or 1
true_model = 0

# number of data items
numData = 100

if true_model==0:
    # Generate data from first model
    data = np.random.randn(numData,1)*s0 + mu0
elif true_model==1:
    # Generate data from second model
    data = np.random.randn(numData,1)*s1 + mu1


#-----------------------------------------------
# Define variables
#-----------------------------------------------
PR = Dirichlet("priorOnModels", theta=(0.5,0.5)) # both models are equally likely
V  = Categorical('V', p=PR)


#-----------------------------------------------
# Define model log-likelihood
#-----------------------------------------------
@potential
def mymodel_loglikel(sw=V):

    if sw==0:        
        loglikel = distributions.normal_like(data, mu0, 1.0/s0**2)

    elif sw==1:        
        loglikel = distributions.normal_like(data, mu1, 1.0/s1**2)

    return loglikel


#-----------------------------------------------
# Run sampler
#-----------------------------------------------
simplemodel = Model([mymodel_loglikel, PR, V])
mc = MCMC(simplemodel)
mc.sample(iter=12000,burn=2000)

figure()
hist(data)
Matplot.plot(mc)
print "expectation of V variable is %.3f" % np.mean(mc.trace('V')[:])

show()

EDIT 2: Updated after Tom's suggestions

In response to a comment below, the weird behaviour I get concerning V, is that its mean posterior is always 0.0. Judging from the trace, it seems that V is not properly sampled. I suspect a minor programming error somewhere, but can't pinpoint it.

I attach below a figure of the trace plots of V.

enter image description here

Could you show the results you get for the posterior of $V$ and explain why you think they are weird. — Juho Kokkala, Oct 29 '14 at 16:25

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

2

When defining a potential, the function should return the log-probability. So instead of

return np.exp(loglikel)

You want

return loglikel

Another thing to watch out for is the expression 1/s0**2. Make sure that s0 is float in this expression, or change it to 1.0/s0**2.

PyMC also has a bug in its choice of proposals for binary variables (documented in section 5.7.3), which I first noticed in another answer. The fix is to add the line:

mc.use_step_method(DiscreteMetropolis, V, proposal_distribution='Prior')

edited Apr 13 '17 at 12:44

Community

1

answered Oct 29 '14 at 17:02

Tom Minka

6,610
1
22
33

Thanks for taking a look at the code Tom. I will try out your suggestions and get back to you.Thanks, Nikos – ngiann Oct 30 '14 at 18:50
I tried out both suggestions and I updated the code above in the original post. However, I still get weird behaviour for for variable V. I have accordingly updated the figure in the original post above. The weird thing is that variable V seems to be somehow stuck? In the trace plot above, it seems to always receive the value 0 during sampling? I imagine that a minor programming error is at fault somewhere here... Thanks, N. – ngiann Oct 31 '14 at 09:40
Try using a different step method for `V`, e.g. `mc.use_step_method(DiscreteMetropolis, V)` before `mc.sample` – Abraham D Flaxman Oct 31 '14 at 14:48
Thanks for the suggestion, but unfortunately it didn't change a thing.... – ngiann Oct 31 '14 at 15:06
I don't understand your updated question. When `true_model = 0`, it estimates `V=0`. When `true_model = 1`, it estimates `V=1`. Isn't that what you want? – Tom Minka Oct 31 '14 at 15:16
Indeed, I might not have been entirely clear. What I would like to get is the uncertainty over which model is the true model, i.e. the model that generated the data. If the two models 0 and 1 are similar and we only have a few data points, then the posterior distribution of V should be fairly wide. If the the two models 0 and 1 are dissimilar (eg. their means differ significantly) then it should be fairly obvious which is the true model and the posterior of V should be narrow. Here unfortunately, I keep getting the same uniform posterior with the mean always stuck at close to 0 or 1. – ngiann Oct 31 '14 at 15:30
The code you have posted is for the 'dissimilar' case where it works fine. You need to post what you consider to be the 'similar' case and why you think the answer from PyMC is wrong. At this point, it's starting to feel like a different question than your original one, and should be asked separately. – Tom Minka Oct 31 '14 at 15:34
Sorry about this. If in the code, I use m0=0.05 and m1=-0.05 and set numData = 1, I still get an expected value for V that is too close to 1. I would expect it to be closer to 0.5. Hence, I suspect there is something wrong with the code. – ngiann Oct 31 '14 at 15:40
Yes, there is another problem. I've updated my answer. – Tom Minka Oct 31 '14 at 15:51
Thanks Tom, you just saved my weekend;) Thank you all for your time, Nikos – ngiann Oct 31 '14 at 15:55

Simple model selection example in PYMC

1 Answers1