I am currently reading a paper whose links is (Exploration Scavenging) http://delivery.acm.org/10.1145/1400000/1390223/p528-langford.pdf?ip=128.135.98.49&id=1390223&acc=ACTIVE%20SERVICE&key=06A6A3A8AFB87403%2E37E789C11FBE2C91%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&acm=1564955884_539f3309dc879c3f2553a877c608e859
It is known that when we have a stochastic exploration policy in a contextual bandit setting, we can apply inverse propensity score to evaluate a new policy.
The paper tries to shed a light on how to evaluate a new policy with having logged data from possibly "deterministic "policy. It claims that if the exploration policy chooses an action independent of a given context at any step, we can evaluate a new policy well. Also, it says if we have a set of several policies which make an action dependent on the given context, but our choice of picking a policy from the set is independent of the context, we can also effectively evaluate a new policy.
My question is, is the time step (the i^th step) at which a policy executes an action also a context? Two questions below are the example of the main question.
If one policy repeats actions 1, 2, 3 from the beginning so that the sequence of its actions is 123123123123..., are they independent of given contexts? I thought so in the first place, but now because the context at the 1st step always receives the action 1, and the context at the 2nd step the action 2, so on and so forth, now it seems like the actions are dependent.
Similarly, suppose we have two polices A and B that are context-dependent. Also, let's say we pick A up to the k^th step and pick B for the rest of the steps. Then, the choice of the polices here is context dependent?
Thank you!