Are the two $\epsilon$-greedy policies different?

Question

I found 2 diffefent versions of $\epsilon$Greedy policy for monte carlo and q learning:

For monte carlo: $\pi (a|s)=\epsilon /m +1-\epsilon$ to choose the best action and $\pi =\epsilon /m$ for other actions

For q learning: $\pi (a|s)=1-\epsilon$ to choose the best action and $\epsilon$ to choose uniformly random action from possible actions

They both are stated as epsilon greedy policy. Are they different? (i think they are) am i missing somethings here or they really have the same name?

P/s: i am pretty sure they are different now, just aliitle confused about the names and the meanings of them in 2 different methods (monte carlo and qlearning)

What is $m$? Could you provide references where you found both definitions? — Tim, May 17 '21 at 06:57
Here the slide 15 for the first one http://web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture9.pdf m is |A(s)| shape of possible actions as i understand — abcd, May 17 '21 at 07:05
The second one is more popular and can be found at many reinforcement learning websites, for example: https://www.google.co.kr/amp/s/www.geeksforgeeks.org/epsilon-greedy-algorithm-in-reinforcement-learning/amp/ — abcd, May 17 '21 at 07:07

Tim · Answer 1 · 2021-05-17T07:13:19.687

0

$\epsilon$-greedy algorithm is taking the currently best policy with probability $1-\epsilon$ and other policy with probability $\epsilon$. The other algorithm you are describing is $\epsilon$-soft algorithm (the linked slides mention it under this name), a different algorithm, hence it uses a different rule.

edited May 17 '21 at 07:13

answered May 17 '21 at 07:08

Tim

108,699
20
212
390

Here specified the first one (in the slide) as epsilon greedy policy also https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-methods-1f004d59686a – abcd May 17 '21 at 07:22
As i found here, the term "epsilon soft policy" only is about the least probability for choosing an action is $\epsilon$/|A(s)| https://stats.stackexchange.com/questions/342379/what-are-soft-policies-in-reinforcement-learning – abcd May 17 '21 at 07:23
@abcd the linked medium post mentions "epsilon greedy policie**s**" and calls the policy "soft" (bolded in post). The $\epsilon$-greedy algorithm is just what I described, though as you learned from multiple sources, there are multiple modifications of this algorithm. The point of $\epsilon$-greedy algorithm is that there is a constant probability for choosing between exploration vs exploitation. – Tim May 17 '21 at 07:31
Yeah it actually is a game of names:) and i agree that the second one "seems" better (it is widely used in q learning) but i dont know why The first one (i. E soft) still is used in monte carlo (as the medium link) – abcd May 17 '21 at 07:34
@abcd each of those variants were designed to solve a particular problem. If you are considering a particular algorithm for your problem, you need to go through the literature & probably benchmark it against some simpler "default" solution first. – Tim May 17 '21 at 07:46

Are the two $\epsilon$-greedy policies different?

1 Answers1