For a Multi-arm Bandits set-up, does the Signal to Noise Ratio have any meaning there?

Question

Suppose that we have a Multi-arm Bandit set-up with $K = 5$ bandits. Each bandit has a reward distribution of:

$$ X_i \sim Bern(p_i), \ \ \ i \in \{1, \ldots, 5\} $$

In literature about MABs, authors speak of a signal-to-noise ratio. I was wondering if there was a way to characterize the set-up in terms of a signal-to-noise ratio and what it might mean? Thanks!

Ben · Accepted Answer · 2018-04-18T06:27:33.263

The strategic trade-off in multi-arm bandit problems: In multi-arm bandit problems the gambler plays one "bandit" each round and attempts to maximise his total expected return over a given number of rounds. In each round of play (except the last), this involves a strategic trade-off by the gambler between two objectives:

Immediate rewards: In each round he would like to choose a distribution that gives him a high expected reward on this round, which entails a preference for distributions he (presently) infers to have a high mean reward;
Future rewards (affected by information gain): On the other hand, he wants to refine his knowledge of the true expected rewards by gaining more information on the distributions (especially those that he has not played as much as others), so that he can improve his choices in future rounds.

The relative importance of these two things will determine the trade-off, and this relative importance is affected by a number of factors. For example, if there is only a small number of remaining rounds in the problem then inference for future trials is relatively less valuable, whereas if there is a large number of remaining rounds then inference for future rewards is relatively more valuable.

As to the "signal-to-noise" ratio (i.e., the variability of the rewards in each distribution), this affects the speed at which the gambler is able to gain reliable information about the expected gain from each distribution. If the signal-to-noise ratio is high (i.e., there is low variability in the rewards) then the gambler gains reliable information on the expected rewards with less uses of that distribution. If the signal-to-noise ratio is low (i.e., there is high variability in the rewards) then the gambler needs more uses of that distribution to gain reliable information on the expected rewards.

Quantifying this strategic trade-off: For problems involving a finite number of rounds, optimal decisions can be solve via backwards induction using Bayesian decision theory. Without loss of generality, we can consider a general problem with $N$ rounds of play using $K$ reward distributions $F_1, ..., F_K$ with respective means $\mu_1, ..., \mu_K$.

We will denote the choice of reward distribution in a round by the action variable $a \in \{ 1, ..., K \}$ and we note that this is allowed to depend on observations in previous rounds. We denote the corresponding (random) gains by $G_a$, where we have $G_a \sim F_{a}$.

Assume that when there are $n$ remaining rounds, the gambler has some belief function $\pi_n$ which is a distribution over the vector of reward distributions (or their parameters if this is a parametric problem). This function represents the gambler's belief about the reward distributions with $n$ remaining rounds, and it obeys the rules of Bayesian updating. (This is a prior distribution if $n = N$, or a posterior distribution based on the previous rounds of play if $n < N$. We will depart from standard Bayesian notation for this function by using the subscript to denote the number of remaining rounds, rather than the number of rounds that have already been played.) Under this belief we denote the corresponding predictive distributions for the rewards as $\tilde{F}_{n,1}, ..., \tilde{F}_{n,K}$.

Consider the situation faced by the gambler with present belief $\pi_n$ facing $n$ remaining rounds of play. We will denote the expected future gains under present action $a$ by $g_n (a |\pi_n)$ and denote the expected future gains under optimal play by $\upsilon_n (\pi_n) \equiv \max_a g_n (a |\pi_n)$. Using a simple backward induction argument, we can express the optimisation problem recursively from the last round of play. Under optimal play these functions must satisfy the following recurrence relation:

$$\begin{equation} \begin{aligned} g_n (a |\pi_n) &= \mathbb{E}(G_a | \pi_n) + \mathbb{E} (\upsilon_{n-1} (\pi_{n-1}(a, G_a)) | G_a \sim \tilde{F}_{n,a} ) \\[6pt] &= \underbrace{\mathbb{E}(G_a | \pi_n)}_{\text{Current}} + \underbrace{\upsilon_{n-1} (\pi_{n})}_{\text{Future}} + \underbrace{\mathbb{E} (\upsilon_{n-1} (\pi_{n-1}(a, G_a)) - \upsilon_{n-1} (\pi_{n}) | G_a \sim \tilde{F}_{n,a} )}_{\text{Value of Information Gain}} \end{aligned} \end{equation}$$

where $\pi_{n-1}$ is the updated posterior distribution given the additional observation $G_a$ for the reward from distribution $a$ (which depends on the chosen action $a$ and the resulting random variable $G_a$), and $\tilde{F}_{n,a}$ is the predictive distribution for $G_a$ using the distribution $\pi_n$. This recursive formula decomposes the expected future gain under a particular action as the sum of: (1) the expected gain in the current round; (2) the expected gain in future rounds under our current information $\pi_{n}$; (3) the gain in expected gains in future rounds due to gaining information from the present round (i.e., the value of gaining this additional information and changing our beliefs from $\pi_{n}$ to $\pi_{n-1}$).

In the case where $n = 1$ (i.e., we are at the last round) we have $\upsilon_0 = 0$ so there is no longer any gain in future rounds, or any gain from further information. By combining this recursive optimisation rule with the rules for Bayesian updating of the belief function, we can solve this problem using backwards induction, by first finding the optimal action in the final round (as a function of the belief at this time) and then solving recursively to find the optimal action in each prior round (each one being a function of the belief at that time).

The relevance of reward variability ("signal-to-noise"): The variability of the reward affects how well the gambler can make inferences about the mean-reward in each reward distribution. This affects the value of additional information (the last term in the recursive value decomposition). If the variability of a reward distribution is low, this means that the gambler will learn rapidly about the distribution by only observing a few values, which means that the information gain starts off high, and then rapidly decreases. On the other hand, if the variability of a reward distribution is high, this means that the gamble will learn slowly about the distribution and will require a lot of observations to make good inferences, which means that the information gain is slower but does not decrease as rapidly.

Given a prior belief $\pi_N$ and a model used by the gambler, it is possible to find the posterior beliefs in each round of play (as a function of the chosen actions and observations in previous rounds), and from this, it is possible to obtain an expression for the term quantifying the value of the information gain from each allowable action. This is complicated, since it involves the interplay of Bayesian updating and backward induction techniques for decision theory. Nonetheless, if it is successfully applied, then the "value of information" term will be a function of the chosen action $a$ and the present belief $\pi_n$ (which is itself a function of the prior belief $\pi_N$ and all previous actions and observations). Since the reward variability affects the posterior updating, any parameters describing this variability should show up as terms in this "value of information" function.

How to model the effect of reward variability: If you would like to characterise this kind of problem in terms of the "signal-to-noise" ratios for each reward (which I take to be some measure of its variability) then you would need to formulate a complete model, solve the resulting posterior form and the optimal strategy, and then use these to calculate the "value of information" function as a function of your "signal-to-noise" measures.

In the model you pose in your question you are using Bernoulli rewards (which is a one parameter distribution) so the mean reward of each distribution fixes its variance. This one-parameter formulation means that your model cannot really accommodate variable signal-to-noise very well. If you would like to explore this issue, I suggest you change to some simple two-parameter distribution, where the scale parameter is a known control variable. A simple example would be to use normally distributed rewards with unknown mean and known variance parameters (with these variance parameters being your measures of reward variability). You could then formulate a strategy using a simple conjugate prior and solve the resulting model to get the "value of information" function. You could then have a look at how this function is affected by changes in the known variance parameters.

In the case of a normally distributed rewards scenario, for the Signal to Noise ratio, I put the square root of $\sigma^2$ in the denominator for the Noise, do you have a suggestion for what to put in the Signal part of the numerator? Would it be the average of the means? — user321627, Apr 21 '18 at 23:36
Using the normal model, I would suggest you take your gains to be normal with unknown means and known variances. Solving this will give results that depend on these variances, but you might not get a term that exactly matches a "signal-to-noise" ratio. Still, you should find that the optimisation depends on the means and variances in a similar way. Just allow the derivation of the optimal solution to unfold and see what you get. — Ben, Apr 21 '18 at 23:49

For a Multi-arm Bandits set-up, does the Signal to Noise Ratio have any meaning there?

1 Answers1

Linked