I'm working on a reinforcement learning problem. The simulation environment is pretty simple (like those maze problems) so I can manually work out its solution. The idea I have is: since I can work out the optimal policy of the environment, is it possible to just use this optimal policy to train the agent, without using Q-learning or policy gradient at all?
Now It seems like a typical imitation learning problem to me now, so instead of doing actual reinforcement learning, I can just do supervised learning. I'm not familiar with this field, a few questions on the details are
- In that case, what should my agent try to learn from my optimal policy? Should it try to learn the value function, or advantage function, or regret of the system with state as input? Or should it directly learn the action with given state?
- Can someone point me to some classical papers or tutorials on imitation learning?
- Does this idea also applies to a stochastic environment?