Greedy policy definition

Question

I've always seen as definition for the greedy policy the one that maximizes the action value function $q_{\pi} (s,a)$ over the actions $a$.
How is this equivalent to the following one that I found on my professor lecture notes?
The greedy policy is equal to 1 if holds: $a = arg max_{a'} q_{\pi} (s,a')$ and zero otherwise.

How are they different?! – Arya McCarthy Jun 12 '21 at 14:10 — Arya McCarthy, Jun 12 '21 at 14:10

Neil Slater · Answer 1 · 2021-06-15T10:41:47.103

Your professor's notes are a more general and formal way of expressing exactly the same idea as your first sentence.

One possible difference is that you may be thinking in terms of a deterministic policy:

$$\pi(s): \mathcal{S} \rightarrow \mathcal{A}$$

Whilst your professor is expressing the function assuming a more general stochastic form of the policy function:

$$\pi(a|s): \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} = \mathbf{Pr}\{A_t=a|S_t=s \}$$

To match your definition, you can declare the greedy policy function like this:

$$\pi(s) = \text{argmax}_a q_{\pi}(s,a)$$

Your professor's version is identical, except it is expressed in terms of probabilities:

$$\pi(a| s) = \begin{cases} 1,& \text{if } a = \text{argmax}_a q_{\pi}(s,a)\\ 0, & \text{otherwise} \end{cases} $$

The first case exactly matches your definition, and it is guaranteed to happen, because no other option has any assigned probability. It is a way of expressing a deterministic policy whilst fitting to the function signature of stochastic one.

score 0 · Answer 2 · answered Jun 26 '21 at 08:18

It's the same due to $$q_\pi(s, argmax_{a'}q_\pi(s, a')) = max_a q_\pi(s, a)$$ The argmax achieves the maximum value.

But there is one thing to consider: In general, there could be more than one action maximizing the q-value. Because of that the argmax is defined as an set: $$a^* \in argmax_{a} v(a) \Leftrightarrow v(a^*)=max_{a} v(a)$$

This makes your definition of the greedy policy difficult, because the sum of all probabilities for actions in one state should sum up to one. $$\sum_{a} \pi(a|s) = 1, \ \ \pi(a|s) \in [0,1]$$

One possible solution is to define the greedy policy as follows: $$\pi(a|s)=\frac{1}{|argmax_{a'}q_\pi(s,a')|} \text{ if } a \in argmax_{a'}q_\pi(s,a'), \text{ else } \pi(a|s)=0$$

Greedy policy definition

2 Answers2