Contextual bandits: Number of models to estimate

Question

I have recently read several papers on contextual bandits especially for the case of binary rewards. However, one very basic aspect is not entirely clear to me:
In some papers (e.g. here https://arxiv.org/pdf/1812.06227.pdf), it is explicitly stated, that for each arm $a_1,...a_K$ the expected reward given some context vector $x_t$ is estimated by a separate linear model, i.e. $E[r_{a,x_t}] = \mu(\theta_a^Tx_t)$ (with a logistic link function $\mu(.)$ when rewards are binary).
In other papers (e.g. here https://arxiv.org/pdf/1703.00048.pdf or here https://arxiv.org/pdf/1805.07458.pdf) it seems like there is only one parameter vector $\theta$ for all arms and the context $x_{t,a}$ contains arm-specific features as well. While in the former case we would estimate $K$ models (one per arm), in the latter case it would be only one model for all arms.

My questions are:

1) Do I understand correctly that both are valid approaches to the contextual bandit problem or did I misunderstand the model formulation? To me they appear to be conceptually quite different.

2) And if they are both valid approaches, is there any systematic comparison between the two?

score 1 · Answer 1 · answered Feb 25 '20 at 03:08

By having separate linear models for each action, you are not enabling generalization between actions. This can be a problem if actions are ordered in some sense. For example, if the task is spam classification and your actions are degrees of belief (A = {Very unlikely, unlikely, neutral, likely, very likely}).

It seems more common to enable this generalization by default, and have a single linear model. The contextual bandits chapter in [1] considers a single linear model for all actions in EXP4. I believe having separate linear models for each action is actually to make the analysis of confidence bound algorithms easier (see for example, https://arxiv.org/abs/1802.09127). I have also seen separate models for each action in the Bayesian DQN literature (https://arxiv.org/pdf/1802.04412.pdf).

To answer your second question, I am not aware of a study that contrasts both approaches. I would certainly expect that a single linear model would be better in general, since it contains separate linear models as a special case.

[1] Lattimore, T., & Szepesv\'ari, Csaba (2018). Bandit Algorithms.

Apprentice · Answer 2 · 2020-07-01T20:21:55.850

Have a look at this: http://rob.schapire.net/papers/www10.pdf and this https://papers.nips.cc/paper/7491-a-smoothed-analysis-of-the-greedy-algorithm-for-the-linear-contextual-bandit-problem.pdf.

It describes the contextual bandit setting for what they call "disjoint linear models" or "multi-parameter setting" ( different for all ) and "Hybrid Linear Models" o "single-parameter setting" in which they have a common parameter ⋆, common to all arms.

The short answer is that the two models are equivalent to one another (up to a factor of k in the problem dimension). However the two settings ave different properties and can be used to model different problems. To cite the paper (https://papers.nips.cc/paper/7491-a-smoothed-analysis-of-the-greedy-algorithm-for-the-linear-contextual-bandit-problem.pdf.):

The single-parameter setting can model, for example, the choice of which of some subset of individuals should participate in a particular clinical trial.

The multi-parameter setting can model, for example, the risk of criminal recidivism amongst different individuals who come from different backgrounds, when observable features correlate differently to crime risk amongst different groups of individuals.

I hope this answer your question.

Contextual bandits: Number of models to estimate

2 Answers2