11

Since I'm a software engineer trying to learn more stats you'll have to forgive me before I even start, this is serious newb territory...

I've been learning PyMC and working through some really (really) simple examples. One problem I can't get to work (and can't find any related examples for) is fitting a model to data generated from two normal distributions.

Say I have 1000 values; 500 generated from a Normal(mean=100, stddev=20) and another 500 generated from a Normal(mean=200, stddev=20).

If I want to fit a model to them, ie determine the two means and the single standard deviation, using PyMC. I know it's something along the lines of ...

mean1 = Uniform('mean1', lower=0.0, upper=200.0)
mean2 = Uniform('mean2', lower=0.0, upper=200.0)
precision = Gamma('precision', alpha=0.1, beta=0.1)

data = read_data_from_file_or_whatever()

@deterministic(plot=False)
def mean(m1=mean1, m2=mean2):
    # but what goes here?

process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

i.e., the generating process is Normal, but mu is one of two values. I just don't know how to represent the "decision" between whether a value comes from m1 or m2.

Perhaps I'm just completely taking the wrong approach to modeling this? Can anyone point me at an example? I can read BUGS and JAGS so anything is ok really.

Cam.Davidson.Pilon
  • 11,476
  • 5
  • 47
  • 75
mat kelcey
  • 213
  • 2
  • 5

2 Answers2

11

Are you absolutely certain that half came from one distribution and the other half from the other? If not, we can model the proportion as a random variable (which is a very bayesian thing to do).

The following is what I would do, some tips are embedded.

from pymc import *

size = 10
p = Uniform( "p", 0 , 1) #this is the fraction that come from mean1 vs mean2

ber = Bernoulli( "ber", p = p, size = size) # produces 1 with proportion p.

precision = Gamma('precision', alpha=0.1, beta=0.1)

mean1 = Normal( "mean1", 0, 0.001 ) #better to use normals versus Uniforms (unless you are certain the value is  truncated at 0 and 200 
mean2 = Normal( "mean2", 0, 0.001 )

@deterministic
def mean( ber = ber, mean1 = mean1, mean2 = mean2):
    return ber*mean1 + (1-ber)*mean2


#generate some artificial data   
v = np.random.randint( 0, 2, size)
data = v*(10+ np.random.randn(size) ) + (1-v)*(-10 + np.random.randn(size ) )


obs = Normal( "obs", mean, precision, value = data, observed = True)

model = Model( {"p":p, "precision": precision, "mean1": mean1, "mean2":mean2, "obs":obs} )
Cam.Davidson.Pilon
  • 11,476
  • 5
  • 47
  • 75
  • 2
    Shameless promotion: I just wrote a blog article about Bayes and pyMC *literally* 1 minute before you posted this, so I invite you to check it out. [The Awesome Power of Bayes - Part 1](http://camdp.com/blogs/awesome-power-ibayesian-methodsi-what-they-didnt-t) – Cam.Davidson.Pilon Dec 27 '12 at 23:23
  • awesome! this approach to the mixing of the two means is exactly what i was trying to get my head around. – mat kelcey Dec 27 '12 at 23:29
  • Not sure I fully understand the true modelling benefit of saying mean1 & mean2 are Normally distributed instead of Uniform (Same goes really for the precision to be honest, I've been using Gamma since "someone else did"). I've got a lot to learn :) – mat kelcey Dec 27 '12 at 23:31
  • Using a Uniform, as in your original example, implies that you know *with absolute certainty* that the mean does not exceed some value. This is somewhat pathological. It is better to use a normal, as it allows all real numbers to be considered. – Cam.Davidson.Pilon Dec 27 '12 at 23:34
  • 1
    The choice of gamma has a mathematical reason. The gamma is the *conjugate prior* of the precision, see table [here](http://en.wikipedia.org/wiki/Conjugate_prior#Continuous_likelihood_distributions) – Cam.Davidson.Pilon Dec 27 '12 at 23:36
  • With a fresh github clone of pymc, this example fails for me with "TypeError: No model on context stack, which is needed to use the Normal('x', 0,1) syntax. Add a 'with model:' block". If I remove the "model =" line, and put everything in a "with Model() as model:" block, I get a "TypeError: __init__() got an unexpected keyword argument 'size'" I'd really appreciate some help to get this working. I'm just learning pymc too, and I've been looking for a working example of representing a mixture model but everything I'm finding on the web is failing for me in one way or another. – Alex Coventry Feb 21 '14 at 20:58
  • @AlexCoventry likely this error is a result of using PyMC3. The syntax above is PyMC2 – Cam.Davidson.Pilon Feb 22 '14 at 03:31
  • @Cam.Davidson.Pilon dead link? – Daniel says Reinstate Monica Feb 10 '21 at 17:36
7

A couple of points, related to the discussion above:

  1. The choice of diffuse normal vs. uniform is pretty academic unless (a) you are worried about conjugacy, in which case you would use the normal or (b) there is some reasonable chance that the true value could be outside the endpoints of the uniform. With PyMC, there is no reason to worry about conjugacy, unless you specifically want to use a Gibbs sampler.

  2. A gamma is actually not a great choice for an uninformative prior to a variance/precision parameter. It can end up being more informative that you think. A better choice is to put a uniform prior on the standard deviation, then transform it by an inverse square. See Gelman 2006 for details.

fonnesbeck
  • 458
  • 4
  • 6
  • 1
    ah fonnesbeck is one of the core developers of pymc! Can you show us an example of how to code point 2? – Cam.Davidson.Pilon Dec 28 '12 at 16:29
  • thanks fonnesbeck and, yes please! to a quick eg of point 2 :) – mat kelcey Dec 29 '12 at 04:41
  • 1
    in fact i'm guessing you mean something along the lines of ... https://gist.github.com/4404631 ? – mat kelcey Dec 29 '12 at 04:51
  • Yes, exactly. You can do the transform a little more concisely: `tau = std_dev**-2` – fonnesbeck Jan 03 '13 at 18:40
  • what would be the right place to read about where this relation between precision and std_dev comes from? – user979 May 22 '13 at 09:37
  • sorry for starting a discussion with myself but I found it here edit: here http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter2_MorePyMC/MorePyMC.ipynb in the Normal Distribution section in the Normal Distributions section – user979 May 22 '13 at 13:07