Questions tagged [multiarmed-bandit]

A problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation.

See https://en.wikipedia.org/wiki/Multi-armed_bandit.

An excerpt from there: A problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.[3][4] This is a classic reinforcement learning problem that exemplifies the exploration-exploitation trade-off dilemma. The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine.[5] The multi-armed bandit problem also falls into the broad category of stochastic scheduling.

164 questions
31
votes
3 answers

Best bandit algorithm?

The most well-known bandit algorithm is upper confidence bound (UCB) which popularized this class of algorithms. Since then I presume there are now better algorithms. What is the current best algorithm (in terms of either empirical performance or…
17
votes
4 answers

In what kind of real-life situations can we use a multi-arm bandit algorithm?

Multi-arm bandits work well in situation where you have choices and you are not sure which one will maximize your well being. You can use the algorithm for some real life situations. As an example, learning can be a good field: If a kid is…
Andy K
  • 173
  • 1
  • 2
  • 19
17
votes
2 answers

What is Thompson Sampling in layman's terms?

I am unable to understand Thompson Sampling and how it works. I was reading about Multi Arm Bandit and after reading Upper Confidence Bound Algorithm, many text suggested that Thompson Sampling performs better than UCB. What is Thompson Sampling, in…
dejavu
  • 271
  • 2
  • 4
14
votes
1 answer

Cost functions for contextual bandits

I'm using vowpal wabbit to solve a contextual-bandit problem. I'm showing ads to users, and I have a fair bit of information about the context in which the ad is shown (e.g. who the user is, what site they're on, etc.). This seems to be a pretty…
13
votes
1 answer

Optimal algorithm for solving n-armed bandit problems?

I've read about a number of algorithms for solving n-armed bandit problems like $\epsilon$-greedy, softmax, and UCB1, but I'm having some trouble sorting through what approach is best for minimizing regret. Is there a known optimal algorithm for…
12
votes
1 answer

Multi armed bandit for general reward distribution

I'm working on a multi-armed bandit problem where we do not have any information about the reward distribution. I have found many papers which guarantee regret bounds for a distribution with known bound, and for general distributions with support…
guest
  • 141
  • 4
12
votes
2 answers

Etymology of multi-armed bandit

I'm studying Reinforcement Learning, and have come across multi-armed bandits. Why are these called bandits? And why are they armed?
Tom Hale
  • 2,231
  • 3
  • 13
  • 31
11
votes
3 answers

Multi-armed bandit algorithms vs Uplift modeling

Multi-Armed Bandit: http://en.wikipedia.org/wiki/Multi-armed_bandit Uplift Modeling: http://en.wikipedia.org/wiki/Uplift_modelling How are these two approaches different? How are they similar? Is one better than the other? Edit: If an example…
AMathew
  • 1,000
  • 12
  • 18
10
votes
2 answers

How is the Upper Confidence Bound derived?

I came across the formula for obtaining the upper confidence bounds on the k-armed bandit problem: $$c\sqrt{\frac{\text{ln} N_i}{n_i}}$$ where $n_i$ is the number of samples we have for this particular bandit and $N_i$ is the total amount of samples…
7
votes
1 answer

Gradient Bandit Algorithm baseline

I am reading Sutton's latest draft of "Reinforcement learning, an introduction" and I came to the Gradient Bandit Algorithm (page 29). I am having a bit of trouble understanding how the baseline should be computed. The book says: "$\bar{R_t} ∈…
6
votes
0 answers

Confidence Interval for least squares estimator

There was a paper by Yasin-Abbasi-Yadkori https://arxiv.org/pdf/1102.2670.pdf titled Online Least Squares Estimation with Self-Normalized Processes. I am trying to give a brief context before asking my question. Let $\{x_t, y_t\}$ be data not…
6
votes
1 answer

Thompson Sampling

I read on Wikipedia that Thompson sampling consists in playing the action ${\displaystyle a \in {\mathcal {A}}}$ according to the probability that this action maximizes the expected reward. This probability seems to be: $\int {\mathbb …
Josh
  • 3,408
  • 4
  • 22
  • 46
6
votes
2 answers

Linear Regret for epsilon-greedy algorithm in Multi-Armed Bandit problem

I am reading about $\epsilon$-greedy algorithm in Multi-Armed Bandit (or $K$ armed bandit) problem, as can be seen here: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies. For the sake of completeness, I am stating the …
Sayan Pal
  • 171
  • 1
  • 9
6
votes
2 answers

A continuous generalization of the binary bandit

There is plenty of reading out there about Bayesian (beta-binomial) multiarm bandits for 0/1 data, but I would like to extend this slightly. To give some context, suppose I have two webpages, A and B. Now I want to test which webpage gets more…
6
votes
1 answer

Bandits with mixed reward processes?

I am trying to model a sequential exploration-exploitation problem with learning as a multi-armed bandit, where the reward mixes a Markovian and a stochastic reward. I understand how to model a bandit problem with Markovian rewards. In a Markovian…
dr___mario
  • 61
  • 4
1
2 3
10 11