6

There is plenty of reading out there about Bayesian (beta-binomial) multiarm bandits for 0/1 data, but I would like to extend this slightly.

To give some context, suppose I have two webpages, A and B. Now I want to test which webpage gets more people to call in, so I begin by randomly serving A, B to incoming visitors.

This is equivalent to starting with the same beta priors on A, B - either $\text{prior}_A = \text{prior}_B = \mathrm{Beta}(0,0)$ aka uninformed or some $\mathrm{Beta}(k_1, k_2)$ for both.

As the binary data comes in (either people call or they don't), I update according to Bayes' Rule and so I end up with $\text{posterior}_A = \mathrm{Beta}(k_1 + \text{wins}_A, k_2 + \text{trials}_A - \text{wins}_A)$ and similarly for B.

Here a trial is me serving the visitor a webpage and a success is a call, to be clear.

My Question

What is the continuous analogue of this binary data model? That is, if my data coming in is now of the form $0.5, 1.6, 8.95$ etc., what can I use to optimize which is better A or B? I have looked into Gaussian processes a little bit but I'm not sure this is what I want. Thanks for any help.

An extension to this question may be found here.

Moderat
  • 739
  • 5
  • 20
  • Can you tell us more about these numbers? Can they be negative? Zero? Even better, what do they represent? Call duration? Session duration? – jlimahaverford Mar 19 '15 at 03:33
  • @jlimahaverford they can be negative or zero or positive. They represent the profit of a given visit: call duration is factored in as well as how much it cost me to get someone to click on my web page – Moderat Mar 19 '15 at 14:21
  • I have extended this question here http://stats.stackexchange.com/questions/168718/multi-armed-bandit-for-continuous-rewards-extended-question - editing the question was rejected, thus the duplicate – Regenschein Aug 25 '15 at 15:11

2 Answers2

4

Disclosure: I know almost nothing about bandits. Still, my suggestion seems like a natuaral generalization of the case you presented. It does not consider the experimental design step in detail (since I don't know what people ususally consider as a loss function in this scenario), so might fail in this respect.

Let me ignore the fact that we have two web pages. You have a web page and your prior for "profit per call" (denoted $\theta$) is (say) Gaussian- $p(\theta) = (2\pi)^{-\frac{1}{2}} \exp( -\frac{\theta^2}{2})$. Lets also say that you believe that if you know the average profit per view, your gains are distributed according to another normal $X \sim \mathcal{N}(\theta, 1)$. Then you have a valid (and very simple) bayesian model.

Now, use the same model for both web pages. After observing $n$ (potential) clients' reactions, you have the distributions for the posterior of $A$ and $B$ which are both Gaussian. Based on whatever target you choose for the experimental design step, you can choose which web page to present next. Calculating and differentiating the target should not be too hard, since we know how to integrate Gaussians.

Yair Daon
  • 2,336
  • 16
  • 29
  • 1
    Thanks for your answer. Here are some questions to consider: what would be a good prior for the gaussian distribution? how to update the prior with the data seen (conjugated prior)? for example in binomial case, the beta is also the conjugated prior for the beta-distribution. how to estimate the mean and the variance, given the data? what does gaussian process mean and how is this term related to your answer? how to avoid randomness, let's say the first three data point give a reward of 1000 to arm 1, how to assure that exploration will still occur? – Regenschein Aug 24 '15 at 15:01
  • Prior - I used a guassian prior for the Gaussian. If we know the variance (assumed 1 in both prior and likelihood) then the Gaussian is its own conjugate prior - one only needs to complete the squares. I did not use any Gaussian process, just prior and likelihood. I can think of a few ways to avoid randomness (e.g. choose other page every once in a while regardless of what is best or using some Information theory criterion) but if you do believe the bayesian model, then randomness is sub optimal. Please let me know if you want me to write an extended answer. – Yair Daon Aug 24 '15 at 15:21
  • I appreciate your answer, my first approach was quiet similiar. A simulation of this approach has shown several weaknesses. For example, consider a reward for the first arm of 10, 15, 20, 15 which results in a mean of 15 and a variance of ~16.6667 and for the second arm 50, 150, 10, 10 having a mean of 55 and a variance of ~4336.7 - now although obviously the second bandit got much higher reward, if you sample from these two gaussian distributions the first arm is explored "too often", the two bells overlap too much. There must be something 'better' in terms of regret bounds out there :-) – Regenschein Aug 24 '15 at 16:08
  • Why do you say the first arm is explored too often? If the variance is high it means you can't predict the result well, so you'd like to keep exploring. – Yair Daon Aug 24 '15 at 16:36
  • But also we want to _minimize the regret_. Therefore we need to minimize the exploration if one bandit is better than the other. Data like this appears in the wild. If it would be as simple as just fitting N gaussians than I'd be surprised - although this approach seems to be totally valid, I'd like to see whether there are some refinements :-) A simple greedy approach seems to work better on constructed examples like the one above so I wonder if there's something in between. In Binomial data, Thompson sampling works quiet well, better than greedy strategies. – Regenschein Aug 24 '15 at 20:21
  • As I mentioned, I don't know much about bandits. Since we're all here to learn, can you add a few lines to the original post (or an answer of your own or to my answer) explaining the main known issues and solutions (e.g. Thompson sampling). Also, just to make sure - I suggest using (essentially) one Gaussian per web page (==arm). – Yair Daon Aug 24 '15 at 20:49
  • I updated the question (currently beeing peer-reviewed) – Regenschein Aug 25 '15 at 13:16
  • My edit has been rejected although I didn't change the original intent of the question :-( – Regenschein Aug 25 '15 at 14:23
  • @Regenschein In case you didn't notice it, I added an answer here that I didn't put under your question. It's a good paper if you're interested in Bayesian reinforcement learning. – Neil G Aug 26 '15 at 04:53
  • 2
    This answer is consistent to the answer in http://stats.stackexchange.com/questions/168718/multi-armed-bandit-for-continuous-rewards-extended-question by Neil G. and accepted for the bounty. – Regenschein Aug 26 '15 at 15:23
3

This problem is tackled by the Dearden paper on Bayesian Q-Learning. He considers a normal model for "returns" (future rewards) like @yair's solution. He also considers another term called the "value of information": "the expected improvement in future decision quality that might arise from the information acquired by exploration."

Also, FYI Beta(0, 0) is not "uninformative". The Jeffreys prior is Beta(0.5, 0.5).

Dearden, Richard, Nir Friedman, and Stuart Russell. "Bayesian Q-learning." AAAI/IAAI. 1998.

Neil G
  • 13,633
  • 3
  • 41
  • 84