1

What would a value iteration algorithm look like if I specify a maximum time step?

For example, from a given state the environment does not reach a terminating state but instead should terminate because it has exceeded the maximum number of steps (and thus the value for that state would the future return up until the number of steps surpasses n). Essentially, I am trying to learn a truncated policy but I am having difficulty figuring out how to do that using dynamic programming. I suppose another way to phrase this would be depth limited value iteration.

Neil Slater
  • 6,089
  • 20
  • 24

1 Answers1

-1

If you add a maximum time step to an environment, then that does not change value calculation algorithms at all. It changes the state representation. You should add either the time step or time remaining as a new property of the state.

Without considering the current time step in the state, if an agent could be in the same state, but with a different amount of time remaining, then it could not calculate a consistent value for expected future reward. The expected future reward will usually be correlated with the amount of remaining time. Without the time in the state, the environment would become non-Markov (or partially observable). Whilst there are some algorithms that may solve this - an RNN variant of DQN springs to mind because it would learn its own time step counter - value iteration is not one of them.

There may be exceptions to this where the state already implies a time step. A good example is Tic Tac Toe (Noughts and Crosses), because the number of filled squares equals the current time step. An environment like this where you added a rule to stop after a certain number of turns (before the board was filled) would not need a separate turn counter.

Neil Slater
  • 6,089
  • 20
  • 24
  • Instead of doing this can, I think of each iteration of value iteration as looking ahead by another step. So if I wanted the maximum length of a trajectory to be 5 timesteps I just perform value iteration for 5 iterations? – Stephane Hatgiskessell Sep 29 '21 at 21:54
  • @StephaneHatgiskessell: That might work, provdied you do **not** use in-place calculations (which might end up calculating for more than one time step in an iteraction depending on the sequence in which you sweep states). In general, the value iteration algorithm effective trajectory length is not *simply* related to the number of iterations, and the chances of a mistake when doing this is high. You may be better off doing a full look-ahead tree search instead if you want to keep the MDP in its simplest form and still limit the time horizon. – Neil Slater Sep 30 '21 at 07:15