6

I read on Wikipedia that Thompson sampling consists in playing the action ${\displaystyle a \in {\mathcal {A}}}$ according to the probability that this action maximizes the expected reward.

This probability seems to be:

$\int {\mathbb {I}}[{\mathbb {E}}(r \;\vert \;a,\theta )=\max _{{a'}}{\mathbb {E}}(r \; | \; a',\theta )]\; P(\theta |{\mathcal {D}})\,d\theta$

How does one derive this Eq? That is, why is the value of the Eq. above the probability of the action maximizing expected reward)?

This Eq. can also be found in papers on Thompson sampling, e.g. first Eq. here.

Josh
  • 3,408
  • 4
  • 22
  • 46

1 Answers1

4

This formula suffers from heavy notation which perhaps makes it a bit difficult to digest.

Let $A$ be the random event that the action $a^*\in\mathcal{A}$ maximizes the expected reward $$\bar{r}(a,\theta)=\mathbb{E}(r|a,\theta).$$

Let $r^*(\theta)$ be the maximum expected reward for given $\theta$, $$ \bar{r}^*(\theta)=\max_{a'}\bar{r}(a',\theta). $$ The event $A$ we are interested in can be written then as follows: $$ A=\{\theta: \bar{r}(a^*,\theta)=\bar{r}^*(\theta)\}. $$ The probability of this event is: $$ \mathbb{P}(A)=\int_A P(\theta|\mathcal{D})d\theta=\int I_A(\theta)P(\theta|\mathcal{D})d\theta. $$ This is exactly the Wikipedia formula (in new notation).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Kostia
  • 1,567
  • 1
  • 8
  • 9
  • 1
    Thanks! Why is the event $A$ defined as a set in the **parameter space** $\theta$ (that is, why is it that $A=\{\theta: \bar{r}(a^*,\theta)=\bar{r}^*(\theta)\}$ ?). Since we are talking about the probability that we are choosing the optimal **action** (i.e. the one that optimizes expected reward), shouldn't A be defined as a set of **actions** in $\mathcal{A}$? – Josh Jul 11 '16 at 00:35
  • 1
    For a fixed action $a^*$, that action may or may not be optimal. This depends on $\theta$, which is random. This makes $A$ also random. Perhaps, my notation is also bad and instead of $A$, it is better to use something like $\Theta_{a^*}$. One more remark: if $A$ is a set of actions, that the indicator function does not depend on $\theta$ and can be take out of integration. – Kostia Jul 11 '16 at 07:37
  • That makes perfect sense. No worries at all about the notation, everything is clearer now. Thanks! – Josh Jul 12 '16 at 17:06