Hello I'm now studying Q learning and SARSA with ε-Greedy , Softmax startegies. And have a question about my readings. In my readings, when SARSA with ε-Greedy, SARSA causes value-function oscilliations in case of stochastic polices. but I think that only softmax startegies can be case of stochastic polices, not Greedy startegy. Maybe reading wants to say that SARSA is on policy so that Q(s',a') oscilliate but I don't know that it means sarsa can have stochastic policy. Please teach me what I miss.
Asked
Active
Viewed 27 times