I am trying out Sarsa deep reinforcement learning on OpenAI gym CartPole-v0 problem. The state has 4 continuous features and the action is binary with either 0 or 1. The state-action vector is then fed to a neural network to output the state-action value. The action with the highest value is then selected as according to Sarsa.
When the network has shape 128-256-128, i could achieve up to 100 points, although it was quite volatile and varies around 30. However, if i choose 128-256-256-128 then the network does not learn at all and always choose 1 action even after i have trained it for 300 episodes.
So my question is is this an expected behavior of reinforcement learning? Is it very sensitive to the network architecture or is it because i have made some mistake in my implementation?