Multi armed bandit for general reward distribution

Question

I'm working on a multi-armed bandit problem where we do not have any information about the reward distribution.

I have found many papers which guarantee regret bounds for a distribution with known bound, and for general distributions with support in [0,1].

I would like to find out if there is a way to perform well in an environment where the reward distribution does not have any guarantees about its support. I'm trying to compute a nonparametric tolerance limit and using that number to scale the reward distribution so I can use the algorithm 2 specified on this paper (http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf). Does anyone think this approach will work?

If not, can anyone point me to the right spot?

Thanks a bunch!

fairidox · Answer 1 · 2013-07-05T09:15:23.793

7

The research into MAB algorithms is closely tied to theoretical performance guarantees. Indeed, the resurgence of interest into these algorithms (recall Thompson sampling was proposed in the 30s) only really happened since Auer's 2002 paper proving $\mathcal{O}(\log(T))$ regret bounds for the various UCB and $\epsilon$-greedy algorithms. As such, there is little interest in problems where the reward distribution has no known bound since there is almost nothing that can be said theoretically.

Even the simple Thompson sampling algorithm you mention requires Bernoulli distributed rewards, and even that took 80 years to prove a logarithmic regret bound!

In practice, however, In cases where you do not know the reward distribution for certain, you may simply scale it to $[0,1]$ by dividing by large number $S$, and if you observe a reward above $S$ just double the value, $S:=2S$. There are no regret guarantees using this approach though, but it typically works quite well.

Also, the Thompson sampling algorithm you mention needs Bernoulli trials, so you can't use arbitrary continuous rewards. You could fit a Gaussian posterior distribution instead of a Beta, but this is a bit sensitive to your choice of prior, so you may want to set it to be very flat. If you're not looking to prove anything about your implementation this will probably work quite well.

edited Jul 05 '13 at 09:15

answered Jul 05 '13 at 03:22

fairidox

1,188
5
16

1

Thanks a lot for the response! I really appreciate it! I had a question though. I think algorithm 2 on the paper (on top of page 39.4) I mentioned does not require anything about the reward distribution BUT the fact that it's support is in [0,1]. Maybe you were looking at algorithm 1? – guest Jul 08 '13 at 19:26
Yeah, cool, quite an interesting trick to convert real values to Bernoulli samples, thanks for pointing that out the detail had escaped me. In any event, as you say, you still need bounded variables, you could do this with the cheap double trick I mentioned and use this version of Thompson sampling. But you might be better of formulating a method that uses a Gaussian posterior. – fairidox Jul 08 '13 at 20:31
I'll look more into the Gaussian posterior method, but what do you mean by "flat" in terms of Gaussian? I would presume that would correspond to something like a Beta(1,1) (uniform) prior, correct? – guest Jul 08 '13 at 20:57
right, but you can't obviously have a uniform prior over an unbounded domain. So, if you have a Gaussian posterior model you would likely have a Gaussian prior, so you generally want to have it as "flat" or uninformative as possible. This generally means making the variance as large as you can stand. I'm no expert but there is a whole field of study on how to construct uninformative, and potentially improper, priors you may want to look into. Also, if you have strictly positive rewards you may want to consider a different model. – fairidox Jul 08 '13 at 21:05

Multi armed bandit for general reward distribution

1 Answers1

Linked

Related