I'm trying to learn more about reinforcement learning, and I've devised a very simple game as a thought experiment. The game consists of a single turn where the agent plays one of three possible cards. The first card, $c_0$ has a payoff of 1, the second $c_1$ has a payoff of 1/2, and the last card $c_2$ has a payoff of 0. Of course, the agent doesn't know this ahead of time, so its job is to play the game repeatedly in order to optimize its policy. The policy can be represented with two parameters, $\theta_0$ and $\theta_1$, which are the probability that the agent plays $c_0$ and $c_1$, respectively. The probability of playing $c_2$ is just $1 - \theta_0 - \theta_1$.
The expected value of a given policy is $$ E[\pi_\theta] = \sum_{i=0}^{2} P(c_i|\theta_i)(1-i/2) = \sum_{i=0}^{2} \theta_i(1-i/2) = \theta_0 + \theta_1 / 2 $$
It's clear that the optimal policy is to only ever play $c_0$, i.e. $\theta_0 = 1$ and all other thetas are 0. However, the agent doesn't know the payoffs ahead of time, nor does it know that they're constants. It has to learn through trial and error.
In order to optimize the agent's performance, I thought the next step was to take the gradient of the expected value with respect to each $\theta_i$ and iteratively update their values.
$$ \frac{\partial E[X_\theta]}{\partial \theta_0} = 1, \frac{\partial E[X_\theta]}{\partial \theta_1} = 1/2, \frac{\partial E[X_\theta]}{\partial \theta_2} = 0 $$
I initialize the agent with all $\theta$s = $1/3$, allow the agent to make a move, then update the weight for whichever card it chose by adding the gradient times a small learning rate, and finally renormalizing across all weights so that they sum to 1. After many iterations, I find that the weights converge to $$ \theta_0 = 2/3, \theta_1 = 1/3, \theta_2 = 0 $$ rather than the optimal policy $$ \theta_0 = 1, \theta_1 = 0, \theta_2 = 0. $$
Is there something wrong with this approach? I can't tell if I'm making a mistake in theory, and in particular in my approach to iteratively updating the thetas.