Dyna-Q Algorithm Reinforcement Learning

Question

In step(f) of the Dyna-Q algorithm we plan by taking random samples from the experience/model for some steps.

Wouldn't it be more efficient if we construct an MDP from experience by computing the state transition probabilities and reward distribution from experience and solve it by dynamic programming?

DYNA-Q:

Neil Slater · Accepted Answer · 2019-09-25T09:52:50.113

Wouldn't it be more efficient if we construct an MDP from experience by computing the state transition probabilities and reward distribution from experience and solve it by dynamic programming?

Depends what you mean by "efficient". The Dyna-Q code given is already strongly related to Value Iteration in any case. If in section (f) you replaced "random previously" steps with repeated full sweeps of all states and actions then you could also set $\alpha = 1$, check for maximum value change below some threshold and the algorithm would be Value Iteration exactly (for a deterministic environment as explained in the pseudocode).

However, a basic dynamic programming approach requires you to collect all state transition rules accurately. If you have a non-deterministic environment* and/or a large state space, this may not be feasible in practice. The algorithm would not start learning until after you collected data, and you have no guidance available for how to efficiently explore the state and action space (because your learning algorithm has nothing to base a policy on).

Dyna-Q allows the agent to start learning and improving incrementally much sooner. It does so at the expense of needing to work with rougher sample estimates of transition probabilities, and therefore needs a recency-weighted value update. In practice the early rough convergence of Q values helps focus exploration and learning effort, and the algorithm will be more efficient both in terms of number of samples required and amount of computation than basic Dynamic Programming across a wide range of problems.

There will be environments where DP methods will be better for some reason though. If you are concerned about it, you need to run some experiments that will help you determine what works for your examples.

The weakness of DP methods may also be an advantage - if you are required to learn an optimal policy for all states of an environment, then DP could well be more efficient way to do this. Dyna-Q will tend to focus on generating accurate values and an optimal policy only on reachable states from start state whilst following a near optimal policy, it may never learn about edge cases. However, this may be a necessary compromise for large state spaces, and if you control the start states seen by agents when they need to act optimally, then there will be no need to learn policies that deal with unreachable states.

I suggest compare the two approaches on a simple problem, like a gridworld, to get a feel for the differences.

* If you have a non-deterministic environment then an accurate transition model needs good estimates of state transition probabilities. This requires that you sample transitions many times, so can multiply the amount of experience you require before the model is reliable enough for DP, compared to deterministic environments. If you have a deterministic environment with 1000 states and 10 actions in each state, then in theory 10,000 carefully crafted experiences will be enough to provide an accuare model for DP. However, if state transitions are stochastic, then you may need 1,000 times that or even more experience before your model of the probabilities is accurate enough to use with DP.

All model-based approaches are sensitive to accuracy in the model, and the more emphasis you put on forward planning based on the model, the more sensitive they are to any errors. You can view Dynamic Programming approaches as being highly sensitive to such errors as they are designed to iterate until convergence, so effects of errors can be magnified.

Thank you Neil for your answer. What I want to say is that if we have enough experience why we just take n random samples from experience in order to update Q-Values and do not solve the MDP as constructed with the experience so far in order to back-propagate the updated information to all states? — gnikol, Sep 25 '19 at 09:22
@gnikol: This answer already covers that. "If you have a non-deterministic environment and/or a large state space, this may not be feasible in practice." I.e. your assumption "if we have enough experience" is the problem - unless the environment is simple enough to do this, or the time spent collecting experience is very large, then it will not happen. — Neil Slater, Sep 25 '19 at 09:26
Could you please explain why is not feasible in case of a non-deterministic environment? Furthermore, I cannot understand why DYNA-Q assumes a deterministic environment? — gnikol, Sep 25 '19 at 09:32
@gnicol: I will extend the explanation for the first part in the answer. Dyna-Q does not assume a deterministic environment, but the pseudocode that you posted does - it says so in the code, where it updates the model, and the assumption is in the model, because it stores a single (R, S') result for each (S, A) in the model and does not attemot to track probabilities — Neil Slater, Sep 25 '19 at 09:40
Thank you Neil. I got it. I asked that question because in my case the state space is small (120) and action space is small (5) and I have a large amount of historical data/experience which was generated from different policies without the application of reinforcement learning. So I want to apply reinforcement learning and take the best I can from the historical data. — gnikol, Sep 25 '19 at 10:01
@gnikol: In your case I expect that offline Q learning (no need for Dyna-Q) or DP based on a model you constructed from the historical data should give roughly the same results. The main reason you might use Dyna-Q is in production if the environment was non-stationary, so your agent needs to continue to explore in case behaviour has changed and there is a new optimal. There is a variant called Dyna-Q+ which adds exploration terms specifically to cover that scenario. Otherwise, you should be able to optimise an agent offline and with such a small space an online planning algorithm is overkill — Neil Slater, Sep 25 '19 at 10:09
If I apply offline Q-Learning how I would use the data I already have? By experience replay? — gnikol, Sep 25 '19 at 10:19
@gnikol: Yes if you want. "Experience replay" is more of an engineering implementation than theory thing. Any routine that repeatedly uses your historical data and runs single-step Q learning updates will be fine. You could also take the Dyna-Q code, fill the model and ignore steps (a) to (e) or take a simple Q learning agent and repeatedly feed it the historical data in sequence (as if it was really happening) until values start to converge. It doesn't really matter - what does matter is the resulting table of action values at the end — Neil Slater, Sep 25 '19 at 10:28

Dyna-Q Algorithm Reinforcement Learning

1 Answers1