Why do networks used for Deep Q-Learning have several outputs?

Question

I try to implement a deep Q-Learning solution for the the Openai gym CartPole problem. My solution doesn't work as my network takes forever to learn anything useful. I compared my solution with other solutions online and noticed a difference. I want to understand why my approach is wrong.

Approach A (not working):
1. Layer: 5 inputs (4 state variables AND 1 action variable)
2. Layer: 20 neurons
3. Layer: 1 output (for value of q(s,a) )

Approach B (working):
1. Layer: 4 inputs (4 state variables)
2. Layer: 20 neurons
3. Layer: 2 outputs (one for each action)

I thought the NN should be an estimator for q(s,a). Then only ont output is required. Most online solutions use two outputs for the two actions. I don't understand why.

EDIT:
Some further information can be found here => Questions about Q-Learning using Neural Networks

score 0 · Accepted Answer · answered Dec 13 '17 at 21:13

Understanding why encoding the actions as output variables works better than encoding the quality as output variable hinges on understanding the hidden layer. The hidden layer intuitively quantifies interactions between the input variables (such that the error is minimised). Apparently, the network is better able to encode the interactions between states only such that they are predictive for the kind of action, then it can encode the interactions of the state and action such that is is predictive for the quality of this combination.

Why is this the case? The hidden layer hasn't got enough units to express al the relevant variation in both the action and state input variables (approach A). In other words, the relation between action+state and quality is to difficult. Think, for example, about the case where the state remains the same but the action changes. Changing the action changes the quality drastically (probably) but the input space state relatively the same (since 4 out of 5 inputs do not change).

In approach B the network can take advantage of implicit knowledge that both types of actions need the same hidden encoding. Hence, the encoding found by the hidden layer can be used for predicting the probability of both kind of actions. Also, small changes in the input space are more likely to give small changes in the output space. This makes the function easier to learn.

Great answer! I looked at other solutions again and found out that they use 2D labels for each fitting. This finding matches your explanation. — siva, Dec 13 '17 at 21:40

Why do networks used for Deep Q-Learning have several outputs?

1 Answers1