I try to implement a deep Q-Learning solution for the the Openai gym CartPole problem. My solution doesn't work as my network takes forever to learn anything useful. I compared my solution with other solutions online and noticed a difference. I want to understand why my approach is wrong.
Approach A (not working):
1. Layer: 5 inputs (4 state variables AND 1 action variable)
2. Layer: 20 neurons
3. Layer: 1 output (for value of q(s,a) )
Approach B (working):
1. Layer: 4 inputs (4 state variables)
2. Layer: 20 neurons
3. Layer: 2 outputs (one for each action)
I thought the NN should be an estimator for q(s,a). Then only ont output is required. Most online solutions use two outputs for the two actions. I don't understand why.
EDIT:
Some further information can be found here =>
Questions about Q-Learning using Neural Networks