Is there a Bayesian approach to density estimation

Question

I am interested to estimate the density of a continuous random variable $X$. One way of doing this that I learnt is the use of Kernel Density Estimation.

But now I am interested in a Bayesian approach that along the following lines. I initially believe that $X$ follows a distribution $F$. I take $n$ readings of $X$. Is there some approach to update $F$ based on my new readings?

I know I sound like I am contradicting myself: If I believe solely in $F$ as my prior distribution, then no data should convince me otherwise. However, suppose $F$ were $Unif[0,1]$ and my data points were like $(0.3, 0.5, 0.9, 1.7)$. Seeing $1.7$, I obviously cannot stick to my prior, but how should I update it?

Update: Based on the suggestions in the comments, I have started looking at Dirichlet process. Let me use the following notations:

$ G \sim DP(\alpha,H)\\ \theta_i | G \sim G\\ x_i | \theta_i \sim N(\theta_i,\sigma^2)$

After framing my original problem in this language, I guess I am interested in the following: $\theta_{n+1} | x_1,...,x_n$. How does one do this?

In this set of notes (page 2), the author did an example of $\theta_{n+1} | \theta_1,...,\theta_n$ (Polya Urn Scheme). I am not sure if this is relevant.

Update 2: I also wish to ask (after seeing the notes): how do people choose $\alpha$ for the DP? It seems like a random choice. In addition, how do people choose a prior $H$ for DP? Should I just use a prior for $\theta$ as my prior for $H$?

"If I believe solely in F as my prior distribution, then no data should convince me otherwise." This is the antithesis of Bayesian inference, which is more along the lines of *take what you believe in one hand and the world in the other hand, and mush them together and see what come out.* Wash, rinse, repeat. — Alexis, Jul 18 '14 at 15:45
Ignoring your last paragraph: there are two common options to this problem. One is a finite mixture of normals (you can choose how many normals based on likelihood in cross validation) or an infinite mixture of normals as @niandra82 is suggesting. These can be done with something like Gibbs sampling or variational inference.. Are you familiar with any of these methods? — , Jul 18 '14 at 18:06
I should also ask, how do you intend to use this KDE? The method chosen and the size (infinite, finite) might depend on your aim. — , Jul 18 '14 at 18:18
This sounds like either a model choice problem or a philosophical one. In reality, our choice of which likelihood to use in Bayesian inference imposes prior beliefs too ... — Zoë Clark, Jul 20 '14 at 05:16
Thanks for the various suggestions. I'm reading up more on them and thinking about them. I'll post a reply in a day or so. — renrenthehamster, Jul 21 '14 at 13:43
@niandra82 I have started looking at dirichlet process and posted an update. — renrenthehamster, Jul 21 '14 at 15:45
@Matthew Ditto as above! In response to your KDE question, I actually have no idea what one can do with the KDE (that's why I'm changing approach). My initial problem is somewhat as follows (not exactly): $X$ is a continuous non-negative R.V., and I want to test that the sup of the support is 1. I was hoping that the KDE will be a good approximation of the actual distribution of $X$, so integrating KDE for $x>1$ will give me some indication. — renrenthehamster, Jul 21 '14 at 15:47
Try to have a look here: http://bayesian.org/sections/BNP/bnp-tutorials-and-videolectures http://www.stats.ox.ac.uk/~teh/npbayes.html http://stat.columbia.edu/~porbanz/talks/npb-tutorial.html http://stat.duke.edu/people/theses/RodriguezA.pdf — niandra82, Jul 21 '14 at 16:04
@niandra82 I tried looking at the links but one thing which I am extremely confused about is how $\theta_i$ is treated as though it is known. Could you please write a more guided approach on how I could obtain a posterior distribution on $\theta$ when I only observe $x_1,...,x_n$ but none of $\theta_1,...,\theta_n$? — renrenthehamster, Jul 22 '14 at 17:45
Today i am really busy, so i try to write an answer tomorrow. Just some questions: DO you know something about Gibbs sampler and Metropolis HAsting? Have you, at least one time, estimated a Bayesian model? — niandra82, Jul 23 '14 at 07:02
Just a remark: You are confusing some aspect. $F$ is the prior distribution for the density of the observed variable but is also a density. Now if is $Unif[0,1]$ and you observe 1.7, since 1.7 is outside the interval, the posterior distribution is 0. You have to specify a prior with the same domain of variable you put the priori on — niandra82, Jul 23 '14 at 08:12
@niandra82 nope, actually this is my first time ever hearing these terms, or even thought about Bayesian modelling (previously, I only know about Bayesian way of estimating parameters, like say $p$ of a Binomial$(n,p)$) — renrenthehamster, Jul 23 '14 at 13:51
@niandra82 I went through the notes further, and I can offer the following approach which works in principle... We have $f(\theta_{n+1}|x_1,...,x_n) = \int f(\theta_{n+1}|x_1,...,x_n,\theta_1,...,\theta_n)f(\theta_1,...,\theta_n|x_1,...,x_n) d\theta_1 ... d\theta_n \propto \int f(\theta_{n+1} | \theta_1,...,\theta_n) f(x_1,...,x_n | \theta_1,...,\theta_n)f(\theta_1,...,\theta_n) d\theta_1 ... d\theta_n $. In principle, we know $f(x_i | \theta_i)$ and $f(\theta_1,...,\theta_n)$ and $f(\theta_{n+1} | \theta_1,...,\theta_n)$. So just somehow compute the integral using some form of Monte Carlo... — renrenthehamster, Jul 23 '14 at 15:30
^ There might be a simpler way because the amount of Monte Carlo I have to do seems quite massive... Especially to get $f(\theta_1,...,\theta_n)=\prod f(\theta_i)$ and to do the multidimensional integration ($n=30$ for me...). I am familiar with iPython and R - not sure if there's easier way to accomplish my task using some package there? — renrenthehamster, Jul 23 '14 at 15:33
Final question: Made an edit in the post about my query for choice of prior $\alpha$ and prior $H$ — renrenthehamster, Jul 23 '14 at 15:35
@renrenthehamster here you are confusing too much things. Before going to the dirichlet process and the nonparametric bayesian, you probably need to learn how estimates a bayesian model, Metropolis hasting and Gibbs sampler, in a simply setting, says a regression. An r package for dirihlet model is DPpackage... — niandra82, Jul 23 '14 at 17:01
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/15913/discussion-between-niandra82-and-renrenthehamster). — niandra82, Jul 23 '14 at 17:13
I couldnt comment on this but I feel like this may be of interest to you:http://en.wikipedia.org/wiki/Conjugate_prior — user52220, Jul 23 '14 at 21:20
Ah! I looked up the terms that you gave in the chat. I think I got everything I need now. Thanks! — renrenthehamster, Jul 24 '14 at 19:38

bean · Answer 1 · 2014-09-28T04:35:58.933

Since you want a bayesian approach, you need to assume some prior knowledge about the thing you want to estimate. This will be in the form of a distribution.

Now, there's the issue that this is now a distribution over distributions. However, this is no problem if you assume that the candidate distributions come from some parameterized class of distributions.

For example, if you want to assume the data is gaussian distributed with unknown mean but known variance, then all you need is a prior over the mean.

MAP estimation of the unknown parameter (call it $\theta$) could proceed by assuming that all the observations / data points are conditionally independent given the unknown parameter. Then, the MAP estimate is

$\hat{\theta} = \arg \max_\theta ( \text{Pr}[x_1,x_2,...,x_n,\theta] )$,

where

$ \text{Pr}[x_1,x_2,...,x_n,\theta] = \text{Pr}[x_1,x_2,...,x_n | \theta] \text{Pr}[\theta] = \text{Pr}[\theta] \prod_{i=1}^n \text{Pr}[x_i | \theta]$.

It should be noted that there are particular combinations of the prior probability $\text{Pr}[\theta]$ and the candidate distributions $\text{Pr}[x | \theta]$ that give rise to easy (closed form) updates as more data points are received.

score 3 · Answer 2 · answered Dec 03 '15 at 16:30

For density estimation purposes what you need is not

$\theta_{n+1}|x_{1},\ldots,x_{n}$.

The formula in notes $\theta_{n+1}|\theta_{1},\ldots,\theta_{n}$ reffers to the predictive distribution of the Dirichlet process.

For density estimation you actually have to sample from the predictive distribution $$ \pi(dx_{n+1}|x_{1},\ldots,x_{n}) $$

Sampling from the above distribution can be done either with conditional methods either with marginal methods. For the conditional methods, take a look at the paper of Stephen Walker [1]. For marginal methods you should check at Radford Neal paper [2].

For the concnetration parameter $\alpha$ Mike West [3] proposes a method for inference in the MCMC procedure including a full conditional distribution for $\alpha$. If you decide not to update the concentration $\alpha$ in the MCMC procedure, you should keep in mind that if you choose a large value for it, then the number of distinct values drawn from the Dirichlet process will be larger than the number of distinct values when a small number for $\alpha$ will be used.

[1] S.G., Walker (2006). Sampling the Dirichlet Mixture model with slices. Communications in Statitics (Simulation and Computation).

[2] R.M., Neal (2000) Markov Chain Monte Carlo methods for Dirichlet Process Mixture models. Journal of Computational and Graphical Statistics. Vol 9, No 2, pp 249-265

[3] M., West (1992). Hyperparameter estimation in Dirichlet process mixture models. Technical report

score -2 · Answer 3 · answered Sep 13 '14 at 20:13

Is there some approach to update F based on my new readings?

There is something precisely for that. It's pretty much the main idea of Bayesian inference.

$p(\theta | y) \propto p(y|\theta)p(\theta)$

The $p(\theta)$ is your prior, what you call $F$. The $p(y|\theta)$ is what Bayesians call the "likelihood" and it is the probability of observing your data given some value of theta. You just multiply them together and get what's called a "posterior" distribution of $\theta$. This is your "updated F". Check out chapter 1 of any Intro to Bayesian Stats book.

You don't have to get rid of $p(\theta)$ (your prior), you just have to realize that it's not your best guess anymore, now that you have data to refine it.

This is not answering what the question is asking. OP is asking how one can put a prior on $F$ when $X_1, \ldots, X_n \stackrel{iid}{\sim} F$. Assuming our prior on $F$ puts probability one on distributions with a density, the likelihood is $L(F) = \prod_{i=1}^N \left.\frac{dF}{dx}\right|_{x = x_i}$. So we need to construct a prior on the space of distribution functions $F$ which are differentiable (which is infinite dimensional), and OP is asking how to do this. — guy, Sep 13 '14 at 21:02

Is there a Bayesian approach to density estimation

3 Answers3