Mechanism of Adversarial Multi-Armed Bandit Problem?

Question

I am studying the Bandit Algorithms book by Tor Lattimore and Csaba Szepesv´ari and I have studied the adversarial bandit problem. However, I don't understand what is the mechanism of adversarial bandit problem. The book says that regret of deterministic policies is linear because, in adversarial bandits, we assume that we will our policy to adversary and he will choose the reward based on our policy. Therefore, the regret would be linear. In this setting, does it mean that we would say that for example, we will always play arm 2 or just we play a deterministic policy? I would be thankful if someone can explain the mechanism of the adversarial bandit and also its contextual form.

Moreover, I am wondering why the regret for the contextual bandit is defined as follows:

$$R_{n}= \mathbf{E} \left[ \sum_{c \in C} \max_{i \in [K]} \sum_{t \in [n]: \: c_t=c} (x_{it}-X_t) \right]$$

Does it mean that the same contextual information should be applied for all periods? I thought that regret should be defined as follows:

$$ R_n= \mathbf{E} \left[ \max_{i \in [k]} \sum_{t \in [n]} \max_{c_t \in C} (x_{it}-X_{t}) \right] $$

Apprentice · Answer 1 · 2022-02-04T11:11:48.527

In an adversarial bandit setting, it is assumed that the reward distributions are fixed in advance by an adversary and are kept fixed during the interaction. This means that for the same action, you could have completely different rewards for different rounds (so it's not deterministic to reply to your question).

The assumption on the linearity of the reward is not necessary for the adversarial setting and, in principle, one could choose another parametrization for the reward.

Mechanism of Adversarial Multi-Armed Bandit Problem?

1 Answers1