0

I found 2 diffefent versions of $\epsilon$Greedy policy for monte carlo and q learning:

For monte carlo: $\pi (a|s)=\epsilon /m +1-\epsilon$ to choose the best action and $\pi =\epsilon /m$ for other actions

For q learning: $\pi (a|s)=1-\epsilon$ to choose the best action and $\epsilon$ to choose uniformly random action from possible actions

They both are stated as epsilon greedy policy. Are they different? (i think they are) am i missing somethings here or they really have the same name?

P/s: i am pretty sure they are different now, just aliitle confused about the names and the meanings of them in 2 different methods (monte carlo and qlearning)

abcd
  • 1
  • 1
  • What is $m$? Could you provide references where you found both definitions? – Tim May 17 '21 at 06:57
  • Here the slide 15 for the first one http://web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture9.pdf m is |A(s)| shape of possible actions as i understand – abcd May 17 '21 at 07:05
  • The second one is more popular and can be found at many reinforcement learning websites, for example: https://www.google.co.kr/amp/s/www.geeksforgeeks.org/epsilon-greedy-algorithm-in-reinforcement-learning/amp/ – abcd May 17 '21 at 07:07

1 Answers1

0

$\epsilon$-greedy algorithm is taking the currently best policy with probability $1-\epsilon$ and other policy with probability $\epsilon$. The other algorithm you are describing is $\epsilon$-soft algorithm (the linked slides mention it under this name), a different algorithm, hence it uses a different rule.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Here specified the first one (in the slide) as epsilon greedy policy also https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-methods-1f004d59686a – abcd May 17 '21 at 07:22
  • As i found here, the term "epsilon soft policy" only is about the least probability for choosing an action is $\epsilon$/|A(s)| https://stats.stackexchange.com/questions/342379/what-are-soft-policies-in-reinforcement-learning – abcd May 17 '21 at 07:23
  • @abcd the linked medium post mentions "epsilon greedy policie**s**" and calls the policy "soft" (bolded in post). The $\epsilon$-greedy algorithm is just what I described, though as you learned from multiple sources, there are multiple modifications of this algorithm. The point of $\epsilon$-greedy algorithm is that there is a constant probability for choosing between exploration vs exploitation. – Tim May 17 '21 at 07:31
  • Yeah it actually is a game of names:) and i agree that the second one "seems" better (it is widely used in q learning) but i dont know why The first one (i. E soft) still is used in monte carlo (as the medium link) – abcd May 17 '21 at 07:34
  • @abcd each of those variants were designed to solve a particular problem. If you are considering a particular algorithm for your problem, you need to go through the literature & probably benchmark it against some simpler "default" solution first. – Tim May 17 '21 at 07:46