Reinforcement Learning with Oracle Policy

Question

I'm working on a reinforcement learning problem. The simulation environment is pretty simple (like those maze problems) so I can manually work out its solution. The idea I have is: since I can work out the optimal policy of the environment, is it possible to just use this optimal policy to train the agent, without using Q-learning or policy gradient at all?

Now It seems like a typical imitation learning problem to me now, so instead of doing actual reinforcement learning, I can just do supervised learning. I'm not familiar with this field, a few questions on the details are

In that case, what should my agent try to learn from my optimal policy? Should it try to learn the value function, or advantage function, or regret of the system with state as input? Or should it directly learn the action with given state?
Can someone point me to some classical papers or tutorials on imitation learning?
Does this idea also applies to a stochastic environment?

score 0 · Answer 1 · answered Aug 31 '21 at 07:52

is it possible to just use this optimal policy to train the agent, without using Q-learning or policy gradient at all?

Yes it is, but you have to decide what it is you want to train the agent to do.

In that case, what should my agent try to learn from my optimal policy?

The two obvious choices are:

Copy the optimal policy. This is a relatively straightforward supervised learning problem. Create a dataset of states and actions, and learn $\pi(a|s) = \mathbf{Pr}\{A_t = a | S_t = s\}$
Learn a value function. To do this you would need to use a reinforcement learning algorithm with a fixed target policy. Most value-based reinforcement learning algorithms have a variant for prediction instead of control that can learn value functions given a fixed policy. Usually that variant is the first one studied, as preeiction is a problem you need to solve before the more complex issue of modifying policies to solve control problems.

Neither of these are required. You have to decide what it is you want the agent to learn, given that you already have the optimal policy.

Does this idea also applies to a stochastic environment?

Yes, provided you are certain that the policy you want to copy or learn value functions for is one of interest to you. It doesn't need to be an optimal policy for you to be interested in it, e.g. it might be a current policy in use and you have been asked to copy it or evaluate it.

Reinforcement Learning with Oracle Policy

1 Answers1