I have recently read several papers on contextual bandits especially for the case of binary rewards. However, one very basic aspect is not entirely clear to me:
In some papers (e.g. here https://arxiv.org/pdf/1812.06227.pdf), it is explicitly stated, that for each arm $a_1,...a_K$ the expected reward given some context vector $x_t$ is estimated by a separate linear model, i.e. $E[r_{a,x_t}] = \mu(\theta_a^Tx_t)$ (with a logistic link function $\mu(.)$ when rewards are binary).
In other papers (e.g. here https://arxiv.org/pdf/1703.00048.pdf or here https://arxiv.org/pdf/1805.07458.pdf) it seems like there is only one parameter vector $\theta$ for all arms and the context $x_{t,a}$ contains arm-specific features as well. While in the former case we would estimate $K$ models (one per arm), in the latter case it would be only one model for all arms.
My questions are:
1) Do I understand correctly that both are valid approaches to the contextual bandit problem or did I misunderstand the model formulation? To me they appear to be conceptually quite different.
2) And if they are both valid approaches, is there any systematic comparison between the two?