I have two questions regarding the upper confidence bounds (UCB) exploration in reinforcement learning:
UCB exploration is derived from Hoeffding's inequality which assumes that the reward is bounded in the interval [0,1]. If the rewards are not bounded in the interval [0,1] and we do not know the min and max reward in order to normalize them in [0,1] how UCB is applied?
The original paper of UCB from Auer et al. 2002 is applied in multi-armed bandit setting where the rewards are iid variables. How can we apply the UCB in the MDP setting where we need an upper confidence bound for Q(s,a)?