UCB Exploration in Reinforcement Learning

Question

I have two questions regarding the upper confidence bounds (UCB) exploration in reinforcement learning:

UCB exploration is derived from Hoeffding's inequality which assumes that the reward is bounded in the interval [0,1]. If the rewards are not bounded in the interval [0,1] and we do not know the min and max reward in order to normalize them in [0,1] how UCB is applied?
The original paper of UCB from Auer et al. 2002 is applied in multi-armed bandit setting where the rewards are iid variables. How can we apply the UCB in the MDP setting where we need an upper confidence bound for Q(s,a)?

Julien S. · Accepted Answer · 2019-10-20T12:47:29.073

A popular hypothesis is that noise is subgaussian, which is broader than bounded variable (it includes gaussians) and for which we also have hoeffding inequality. Yet, there is still a parameter which is the analogue of the the length of the interval for bounded variables and quantifies the level of noise. If you dont know it beforehand, you can tune it by measuring the empirical variance. This algorithm is called UCB-V.
There is an RL adaptation of UCB1, namely UCRL. It is optimistic on the reward and on the transition probabilities. This algorithm has strong theoretical background but is not very practical. Being optimistic on the transition probabilities means selecting a transition model in the confidence region which maximizes the cumulative reward. It is computationally expensive. Exploration in practical algorithm is often done by being optimistic solely on the state-action reward function.

Thank you very much. The links you posted are very helpful – gnikol Mar 13 '20 at 23:49 — gnikol, Mar 13 '20 at 23:49

1 Answers1