1

I just started learning value iteration in reinforcement learning and I am confused about the theorem indicating that the iterations to have an error of at most $\epsilon$ grows with $λ$ is:

$$N = \frac{\log \left(\frac{2R_{max}}{\epsilon(1-λ)}\right)}{\log \left(\frac{1}{λ}\right)}$$

Where for the reward, $|r|\le R_{max}$. How is this number of iteration $N$ derived?

bayerj
  • 12,735
  • 3
  • 35
  • 56
Williamwyn
  • 13
  • 2

1 Answers1

0

This bound can be derived as follows. Assume first that the value function estimator $\hat{V}^0$ is initialized to be zero for all states, where the superscript indicates the iteration at which the value function estimator is considered. At first we have that $$\max_{s\in\mathcal{S}}|V(s)-\hat{V}^0(s)|\le\sum\limits_{t=0}^{\infty}\lambda^tR_{\rm{max}}=\frac{R_{\rm{max}}}{1-\lambda}.$$ In subsequent iterations it holds that $$e^{k+1}\triangleq\max_{s\in\mathcal{S}}|V(s)-\hat{V}^{k+1}(s)|=\max_{s\in\mathcal{S}}|r(s)+\lambda V'(s)-r(s)-\lambda\hat{V}'^k(s)|\\=\max_{s\in\mathcal{S}}\lambda|V'(s)-\hat{V}'^k(s)|=\lambda e^k,$$ where I assumed a deterministic policy and deterministic transition dynamics. Furthermore, the quantity $V'(s)$ is the value of the subsequent state and $\hat{V}'^k(s)$ is defined the same way. If we now want that $e^N\le\varepsilon$, then $\lambda^N\frac{R_{\rm{max}}}{1-\lambda}\le\varepsilon$, which implies that $(e^{\log{\lambda}})^N=e^{N\log{\lambda}}\le\frac{\varepsilon(1-\lambda)}{R_{\rm{max}}}$. Then $N\log{\lambda}\le\log{\frac{\varepsilon(1-\lambda)}{R_{\rm{max}}}}=-\log{\frac{R_{\rm{max}}}{\varepsilon(1-\lambda)}}$ or $-N\log{\frac{1}{\lambda}}\le-\log{\frac{R_{\rm{max}}}{\varepsilon(1-\lambda)}}$. And finally $N\ge\frac{\log{\frac{R_{\rm{max}}}{\varepsilon(1-\lambda)}}}{\log{\frac{1}{\lambda}}}$.

Not sure, where the factor 2 comes from. Also, you probably want to consider the more general case with stochastic transition dynamics and a stochastic policy. The derivation above should serve as a good starting point to do so though.