Reinforcement learning using the gradient of expected value doesn't lead to the optimal policy

Question

I'm trying to learn more about reinforcement learning, and I've devised a very simple game as a thought experiment. The game consists of a single turn where the agent plays one of three possible cards. The first card, $c_0$ has a payoff of 1, the second $c_1$ has a payoff of 1/2, and the last card $c_2$ has a payoff of 0. Of course, the agent doesn't know this ahead of time, so its job is to play the game repeatedly in order to optimize its policy. The policy can be represented with two parameters, $\theta_0$ and $\theta_1$, which are the probability that the agent plays $c_0$ and $c_1$, respectively. The probability of playing $c_2$ is just $1 - \theta_0 - \theta_1$.

The expected value of a given policy is $$ E[\pi_\theta] = \sum_{i=0}^{2} P(c_i|\theta_i)(1-i/2) = \sum_{i=0}^{2} \theta_i(1-i/2) = \theta_0 + \theta_1 / 2 $$

It's clear that the optimal policy is to only ever play $c_0$, i.e. $\theta_0 = 1$ and all other thetas are 0. However, the agent doesn't know the payoffs ahead of time, nor does it know that they're constants. It has to learn through trial and error.

In order to optimize the agent's performance, I thought the next step was to take the gradient of the expected value with respect to each $\theta_i$ and iteratively update their values.

$$ \frac{\partial E[X_\theta]}{\partial \theta_0} = 1, \frac{\partial E[X_\theta]}{\partial \theta_1} = 1/2, \frac{\partial E[X_\theta]}{\partial \theta_2} = 0 $$

I initialize the agent with all $\theta$s = $1/3$, allow the agent to make a move, then update the weight for whichever card it chose by adding the gradient times a small learning rate, and finally renormalizing across all weights so that they sum to 1. After many iterations, I find that the weights converge to $$ \theta_0 = 2/3, \theta_1 = 1/3, \theta_2 = 0 $$ rather than the optimal policy $$ \theta_0 = 1, \theta_1 = 0, \theta_2 = 0. $$

Is there something wrong with this approach? I can't tell if I'm making a mistake in theory, and in particular in my approach to iteratively updating the thetas.

score 1 · Accepted Answer · answered May 14 '20 at 17:13

The idea of taking a gradient step and then adjusting the parameter values to respect the constraint is called "projected gradient descent", and it is theoretically sound. The only catch is that the projection step takes the current parameters $\theta$ and replaces them with the closest $\theta'$ inside the feasible set.

You didn't specify exactly how you're doing the renormalization, but if you're just dividing by the sum, that would not be a proper projection. For an example, consider $\theta = \{1.2, 0.8, 0.0\}$. Just dividing by the sum gives $\theta' = \{0.6, 0.4, 0.0\}$, and the distance between $\theta$ and $\theta'$ is ~0.72. However $\theta^* = \{0.7, 0.3, 0.0 \}$ is also inside the feasible set and has distance only ~0.71.

There's a wikipedia page on the general problem of projecting points onto the simplex, but for the 2-simplex you might as well just work out the small number of cases.

Reinforcement learning using the gradient of expected value doesn't lead to the optimal policy

1 Answers1