Policy evaluation in contextual bandit setting

Question

I am currently reading a paper whose links is (Exploration Scavenging) http://delivery.acm.org/10.1145/1400000/1390223/p528-langford.pdf?ip=128.135.98.49&id=1390223&acc=ACTIVE%20SERVICE&key=06A6A3A8AFB87403%2E37E789C11FBE2C91%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&acm=1564955884_539f3309dc879c3f2553a877c608e859

It is known that when we have a stochastic exploration policy in a contextual bandit setting, we can apply inverse propensity score to evaluate a new policy.

The paper tries to shed a light on how to evaluate a new policy with having logged data from possibly "deterministic "policy. It claims that if the exploration policy chooses an action independent of a given context at any step, we can evaluate a new policy well. Also, it says if we have a set of several policies which make an action dependent on the given context, but our choice of picking a policy from the set is independent of the context, we can also effectively evaluate a new policy.

My question is, is the time step (the i^th step) at which a policy executes an action also a context? Two questions below are the example of the main question.

If one policy repeats actions 1, 2, 3 from the beginning so that the sequence of its actions is 123123123123..., are they independent of given contexts? I thought so in the first place, but now because the context at the 1st step always receives the action 1, and the context at the 2nd step the action 2, so on and so forth, now it seems like the actions are dependent.

Similarly, suppose we have two polices A and B that are context-dependent. Also, let's say we pick A up to the k^th step and pick B for the rest of the steps. Then, the choice of the polices here is context dependent?

Thank you!

Neil Slater · Accepted Answer · 2019-08-05T10:06:06.190

1

My question is, is the time step (the i^th step) at which a policy executes an action also a context?

Contextual bandit algorithms should not use the time step as part of the context, they require state to be independent of time step. Making the state include time step relates features on $t$ and $t+1$, making the state evolve predictably, so moves the problem definition from bandit algorithms to full reinforcement learning.

As a result, your example:

If one policy repeats actions 1, 2, 3 from the beginning so that the sequence of its actions is 123123123123

is not a policy for a bandit algorithm, because in order to create it you would need the policy to be a function of $t$.

suppose we have two polices A and B that are context-dependent. Also, let's say we pick A up to the k^th step and pick B for the rest of the steps. Then, the choice of the polices here is context dependent?

No, because the time step is not part of the context. However, you could arrange data after collecting it so that the sequence did contain context - e.g. sort the data by one of the features. In which case your selection between the two policies would break the requirement.

edited Aug 05 '19 at 10:06

answered Aug 05 '19 at 08:21

Neil Slater

6,089
20
24

Thank you so much. Your answer perfectly helped me to understand. I just have two more quick questions and would appreciate it if you answer them as well. 1) As you pointed out, a policy for a bandit algorithm is a function from "contexts" (that's why the policy in my first example is not applicable to a bandit problem). Thus, when the paper assumes "the exploration policy chooses an action independent of a given context at any step" for the sake of the evaluation of a new policy, is it correct that any single deterministic policy cannot satisfy such condition but only a stochastic policy? – Hunnam Aug 06 '19 at 17:18
I cannot imagine deterministic policy, which is a function from contexts, being independent of contexts. – Hunnam Aug 06 '19 at 17:18
2) Let's view the whole process in the 2nd example (A and B being context-dependent, but the selection of them is not but depends on the steps) as one exploration policy. Then, I think the contexts here are not from the identical distribution because even thought they were, as the policy divides the whole process into two steps, they are not from the identical distribution anymore. Is it correct? Thank you! – Hunnam Aug 06 '19 at 17:22
Or the exploration policy is just not applicable to the bandit problem since it depends on the steps.. – Hunnam Aug 06 '19 at 17:22
@Hunnam: Yes it is referring to a stochastic policy. It also has to have a distribution independent from context. A simple even distribution would do. – Neil Slater Aug 06 '19 at 18:48

Policy evaluation in contextual bandit setting

1 Answers1