I've always seen as definition for the greedy policy the one that maximizes the action value function
$q_{\pi} (s,a)$ over the actions $a$.
How is this equivalent to the following one that I found on my professor lecture notes?
The greedy policy is equal to 1 if holds: $a = arg max_{a'} q_{\pi} (s,a')$ and zero otherwise.

- 19
- 3
-
2How are they different?! – Arya McCarthy Jun 12 '21 at 14:10
2 Answers
Your professor's notes are a more general and formal way of expressing exactly the same idea as your first sentence.
One possible difference is that you may be thinking in terms of a deterministic policy:
$$\pi(s): \mathcal{S} \rightarrow \mathcal{A}$$
Whilst your professor is expressing the function assuming a more general stochastic form of the policy function:
$$\pi(a|s): \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} = \mathbf{Pr}\{A_t=a|S_t=s \}$$
To match your definition, you can declare the greedy policy function like this:
$$\pi(s) = \text{argmax}_a q_{\pi}(s,a)$$
Your professor's version is identical, except it is expressed in terms of probabilities:
$$\pi(a| s) = \begin{cases} 1,& \text{if } a = \text{argmax}_a q_{\pi}(s,a)\\ 0, & \text{otherwise} \end{cases} $$
The first case exactly matches your definition, and it is guaranteed to happen, because no other option has any assigned probability. It is a way of expressing a deterministic policy whilst fitting to the function signature of stochastic one.

- 6,089
- 20
- 24
It's the same due to $$q_\pi(s, argmax_{a'}q_\pi(s, a')) = max_a q_\pi(s, a)$$ The argmax achieves the maximum value.
But there is one thing to consider: In general, there could be more than one action maximizing the q-value. Because of that the argmax is defined as an set: $$a^* \in argmax_{a} v(a) \Leftrightarrow v(a^*)=max_{a} v(a)$$
This makes your definition of the greedy policy difficult, because the sum of all probabilities for actions in one state should sum up to one. $$\sum_{a} \pi(a|s) = 1, \ \ \pi(a|s) \in [0,1]$$
One possible solution is to define the greedy policy as follows: $$\pi(a|s)=\frac{1}{|argmax_{a'}q_\pi(s,a')|} \text{ if } a \in argmax_{a'}q_\pi(s,a'), \text{ else } \pi(a|s)=0$$

- 23
- 5