0

We need to define just a few things first:

  1. $(a_t, s_t) \in \mathcal{A} \times \mathcal{S}$ are the action and state at time $t$;
  2. $r(s_t, a_t)$ is the reward for taking action $a_t$ in state $s_t$;
  3. Regarding timing, our agent observes $s_t$, then chooses $a_t$ at every step $t = 1, ..., T$. Following this, he receives the reward $r(s_t, a_t)$ and a new state is drawn as $s_{t+1} \sim p(s_{t+1} | s_t, a_t)$. Obviously, by convention, $p(s_1 | s_0, a_0) = p(s_1)$ and $p(s_{T+1} = s^*| s_T, A_T) = p(s_{T+1} = s^*) = 1$ (i.e., the initial distribution can't depend on a past that doesn't exist and the final transition is toward a terminal state $s^*$ with 100% probability because the game always ends at time $T$);
  4. We define a policy as a possibly degenerate density function $\pi(s_t, a_t)$ whence action are drawn;
  5. Our agent cares about the total expected reward: $E \left[ \sum_{t=1}^T r(s_t, a_t) \right]$

The problem I am having here is that the state vector should contain a time index unless the horizon is infinite. For example, in economics, we'd go from $$ V(k_t) := \underset{\{ c_t, k_{t+1} \}_{t=0}^\infty}{\max} E_0 \sum_{t=0}^\infty \beta^t ln c_t \; \; s.t. \; \; c_t + k_{t+1} = z_t k_t^\alpha + (1-\delta) k_t$$ to $$ V(k_t) := \underset{ c_t, k_{t+1} }{\max} \left\{ ln c_t + \beta E_t \left[ V(k_{t+1}) \right] \right\} \; \; s.t. \; \; c_t + k_{t+1} = z_t k_t^\alpha + (1-\delta) k_t$$ where, say, $z_t \sim N(0,\sigma^2)$, specifically because the infinite horizon means you can go from the first to the next. If the horizon was finite, the expected value going forward after one step would be bound to be lower and I'd get $$ V(k_t, t) := \underset{ c_t, k_{t+1} }{\max} \left\{ ln c_t + \beta E_t \left[ V(k_{t+1}, t+1) \right] \right\} \; \; s.t. \; \; c_t + k_{t+1} = z_t k_t^\alpha + (1-\delta) k_t$$

and you'd solve that by backward induction: $V(k_{T+1}, T+1) = 0$ is obvious (the game is done), then the objective is monotonic so the solution at $T$ is to consume everything $c_T = z_T + (1-\delta)k_T$ and save nothing $k_{T+1} = 0$ which gives you $V(k_T, T)$; etc.

What I don't get is why none of the above seems to be discussed in reinforcement learning. They just seem to continually operate as if you can always write a Bellman equation in a policy value function or an action value function without worrying about a finite horizon. Sutton and Barto (2020) mention the point that some episodes do end (sometimes randomly), but that's about as far as it goes. When they present the policy gradient theorem, it's for an infinite horizon.

Anyone sees where I'm missing something?

Stéphane
  • 213
  • 1
  • 6
  • can't you just define a duplicate of your states, called $\mathcal{S}'$, which correspond to finite-time horizon states? In other words, if you reach them, then any further action will perpetually keep you in those states. Yes, you lose uniqueness of an optimal policy since you now have a disconnected state space, but then your incremental reward stays at 0 once you're in these finite states. Alternatively, the discount factor $\beta$ essentially approximates a finite horizon, especially by making it smaller. – Alex R. Feb 05 '21 at 00:42
  • I just ran into a paper that did exactly what I described above: you **must** take into account the time step when defining value functions ($V^\pi, Q^\pi$) and even the policy $\pi$. https://sudeepraja.github.io/papers/PEPSRL.pdf I think it might have something to do with the things I previously read focusing on different problems. – Stéphane Feb 05 '21 at 04:30

0 Answers0