quick questions about a contextual bandit problem

Question

I am currently reading the paper "Learning from Logged Implicit Exploration Data" https://arxiv.org/pdf/1003.0120.pdf. But I believe the questions I have can be answered without reading the whole paper, so I would greatly appreciate it if you help me.

The paper considers a contextual bandit problem and wants to do the offline evaluation of a new policy when the logging policy is unknown and thus could be deterministic so that we can't just apply inverse propensity score.

I have 2 questions about the part below.

why can the situation explained be treated as randomizing between actions $a$ and $b$? I understand it is assumed that the number of events is the same each day and the events are IID. But then does that mean it is randomizing because the action that is called action $a$ could have been called action $b$ and used on day 2, instead of day 1? The whole situation is deterministic, so I don't understand why this can be viewed as randomizing.
So the definition of $\hat{\pi}(a|x)$ is simply (the number of the action $a$ displayed given the feature $x$/the total number of actions displayed$)?

Thank you!

score 0 · Accepted Answer · answered Jul 29 '19 at 23:54

0

1) As far as I could figure, the point in (1) is to give a reasoning on why you could treat it as a random selection process. They give an example where the day might not be part of the context they want to consider (for example - the algorithm had a chance to choose differently, but didn't). In general, if everything is deterministic, then at each context you will get just a determined bandit, and won't be able to correct the selection bias.

2) $\hat{\pi}(a \mid x)$ doesn't have to be the specific estimator you described. It can work with any estimator of $\pi(a \mid x)$ (for example - multinomial regression, if you believe that it was the way the bandits where chosen). Note that bad estimators for $\pi$ would yield bad estimators for the policy performance.

answered Jul 29 '19 at 23:54

tmrlvi

937
6
15

Thanks! I just need little bit more clarification. 1) You said "They give an example where the day might not be part of the context they want to consider". Does it mean that without considering the day, the situation could be either (a, b) or (b, a), so the probability is 1/2? – Hunnam Jul 30 '19 at 00:21
Yes, but they sort of deduce it from the data - they saw $a$ once and $b$ once, and therefore attribute 0.5 chance for either. That's the reason for the end of the comment - `when the number of events is the sameeach day, and the events are IID` – tmrlvi Jul 30 '19 at 00:26
2) I am an undergrad who has no statistics background (currently self studying), so I don't know what multinomial regression is (of course I can google it). You mean here that there are ways that make it possible to estimate the probability of a deterministic policy? I don't understand how it can be possible because the probability of a deterministic policy choosing an action is either 0 or 1... it seems like this must be explained by 1), but my bad understanding prevents that! – Hunnam Jul 30 '19 at 00:27
Ok, so when the assumptions are satisfied, then we are able to treat the data as being randomized. Is this right? And we just apply this reasoning when we work with the estimator of $\pi$, no matter which one we are working with. – Hunnam Jul 30 '19 at 00:30
2) In essence, there are stochastic policies that are "close enough" to the deterministic one. Rigorously, this is treated by Corollary 3.1. Note that the "close enough" is capture by both the regret and the threshold $\tau$. – tmrlvi Jul 30 '19 at 00:36
Thanks so much. I accepted your answer. Because you seem to have read the part around Theorem 3.1, I would appreciate it if you could answer one more question. The theorem stats that "for any sequence". Then, the order of the execution of the polices doesn't have to be from $\pi_1$ to $\pi_T$, but it can be arbitrary? like $\pi_3$, $\pi_7$, and then $\pi_1$. – Hunnam Jul 30 '19 at 01:11
Yes. In fact, this is the main part in the proof of theorem 3.1 - changing the order of the chosen policies. Note that this is possible since the expectation is linear, and the policies don't depend on the data. – tmrlvi Jul 30 '19 at 08:00
Hello! I would really appreciate it if you could answer my last question..! In the theorem 3.1 we talked about, there is an assumption about identical draws over T rounds. What does that mean? Draws in policy or context? My main questions is, can it be interpreted as the order of picking policies is or must be irrelevant to given contexts so that the execution of the polices can be regarded as one stochastic policy? Thank you! – Hunnam Aug 04 '19 at 23:29

quick questions about a contextual bandit problem

1 Answers1