1

I am trying to train a Deep Q Network (https://deepmind.com/research/dqn/) for a simple control task. The agent starts in the middle of a 1-dimensional line, at state 0.5. On each step, the agent can move left, move right, or stay put. The agent then receives a reward proportional to the absolute distance from its starting state. If the agent reaches state 0.35 or state 0.65, then the episode ends. So, effectively the agent is encouraged to move all the way to the left, or all the way to the right.

For now, I am just trying to train a deep neural network in Keras / Tensorflow, to predict the q-values of the three actions, in any given state. At each time step, actions are chosen randomly, so I am just trying to see if the network can predict the correct q-values, I am not trying to get the agent to actually solve the task. However, my network is struggling to learn the correct function.

In this first figure below, the dots represent the training data available to the network on each weight update. These are stored in the experience replay buffer in the standard q-learning algorithm. As can be seen, states further from the centre have higher q-values, as would be expected. Also, when the state is greater than 0.5, higher q-values are assigned to the "move right" action, as would be expected, and vice versa. So, it seems that the training data is valid.

1]

In the next figure below, the predicted q-values are shown from my network. To get these values, I sampled uniformly from the range 0.35-0.65 to get the inputs, and then plotted the q-values for each action (the three coloured lines). The gradient is computed with respect to the difference between these predictions, and the training data in the first figure. So, these lines should match up to the points in the first figure. However, clearly they do not. They are predicting roughly the right values, but the shape of the lines are not consistent with the training data.

enter image description here

In the third figure below, I now plot the training loss for this network. So, it appears that training has converged, even though there is still significant error between the training data and the predictions. I am using mean squared error loss.

enter image description here

Finally, here are some details on my network itself. I use 1 layer of 16 hidden nodes, each with Relu activations, and the final layer has 3 nodes, one for each action. Although this is a simple network, it has enough capacity to learn the simple function in the first figure, from my prior experiments. I use the AdamOptimizer for optimisation, with a learning rate of 0.01, and I have also tried with a learning rate of 0.001 which also does not give the desired results. My batch size is 32. As with the standard Deep Q Network approach, I used a separate network for training the q values to the network used to predict the q values for the other states (as used in the Bellman equation).

So, I am puzzled as to what is going on. My network should be able to fit the training data, but it is not able to. I am not sure whether this is a problem with my optimisation, my network architecture, or my q learning algorithm. My intuition is that the learning rate is too low, because the predicted q values show a shape significantly different to the training q values, and so the network needs to be bumped significantly in the right direction. However, I am already using a learning rate of 0.01, which is very high.

Another thought I had is that the training error is pretty small (3 x 10^-4), and so it may not be possible to reduce this any more. Therefore, I would need to increase the size of the predicted q values (e.g. by increasing the scale of the reward), such that the absolute error is much larger, and therefore the optimiser can compute effective gradients. Does this make sense?

Any suggestions?

Karnivaurus
  • 5,909
  • 10
  • 36
  • 52
  • How much training data is that? i.e. if $D = \{s_{i},a_{i},r_{i},s'_{i}\}_{i=1}^{n}$, what is n? – rnoodle May 18 '17 at 14:22
  • 1
    Also, are the transitions in that training data stochastic or deterministic i.e. does $s'_{t}=p(s_{t},a_{t})$, or does $S'_{t}\sim p(\cdot|s_{t},a_{t})$? Not so important, but, is the initial state drawn from a small state distribution around 0.5? I would also experiment with a slightly bigger architecture and do grid searches on the hyper parameters (learning rate etc.) – rnoodle May 18 '17 at 14:28
  • For the training data, my experience replay buffer has 400 entries (which is what each minibatch is sampled from). But the buffer itself is updated regularly, and over the course of this training it has received over 100k transitions (n = 100k). – Karnivaurus May 18 '17 at 14:35
  • For the transition model, this is deterministic. Every state and action pair will result in the same next state. – Karnivaurus May 18 '17 at 14:36
  • DQN targets should be samples of the optimal Bellman equation, $r(s,a)+\gamma \max_{a'\in\mathcal{A}}Q(s',a')$. Are you replacing $\max_{a'\in\mathcal{A}}Q(s',a')$ with the training Q's in your graph? Why do the states look discretised in some way? – rnoodle May 18 '17 at 14:57
  • In the first graph, each dot represents a transition sample. The x-axis is the current state, and the y-axis is the q-value as determined by the Bellman equation. The discretisation is because every time the agent "moves right", then it moves by a discrete amount to the right. So the agent never actually exists in states between these discrete steps. – Karnivaurus May 18 '17 at 15:05
  • So why are they grouped like that? Is it because there is a small initial state distribution? It's a bit weird. Your replay buffer is far too small for such a problem. The network will completely forget about the last 400 entries. Put all of the data into experience replay and sample uniformly. You are in batch mode remember and _not_ online with millions of samples. – rnoodle May 18 '17 at 15:43
  • 1
    I think your reward function is a bit weird i.e. am I right in saying that it rewards an agent to oscillate far away from s=0.5? Instead you should put a 'blob' over each end state e.g. $r(s,a)=\exp^{-(s-0.65)^2} + \exp^{-(s-0.35)^2} $ or something like that. You would possibly need to scale the exponentials to fit the task. – rnoodle May 18 '17 at 15:55

0 Answers0