19

I have a doubt about how exactly the loss function of a Deep Q-Learning Network is trained. I am using a 2 layer feedforward network with linear output layer and relu hidden layers.

  1. Let's suppose I have 4 possible actions. Thus, the output of my network for the current state $s_t$ is $Q(s_t) \in \mathbb{R}^4$. To make it more concrete let's assume $Q(s_t) = [1.3, 0.4, 4.3, 1.5]$
  2. Now I take the action $a_t = 2$ corresponding to the value $4.3$ i.e the 3rd action, and reach a new state $s_{t+1}$.
  3. Next, I compute the forward pass with state $s_{t+1}$ and lets say I obtain the following values at the output layer $Q(s_{t+1}) = [9.1, 2.4, 0.1, 0.3]$. Also let's say the reward $r_t = 2$, and $\gamma = 1.0$.
  4. Is the loss given by:

    $\mathcal{L} = (11.1- 4.3)^2$

    OR

    $\mathcal{L} = \frac{1}{4}\sum_{i=0}^3 ([11.1, 11.1, 11.1, 11.1] - [1.3, 0.4, 4.3, 1.5])^2$

    OR

    $\mathcal{L} = \frac{1}{4}\sum_{i=0}^3 ([11.1, 4.4, 2.1, 2.3] - [1.3, 0.4, 4.3, 1.5])^2$

Thank you, sorry I had to write this out in a very basic way... I am confused by all the notation. ( I think the correct answer is the second one...)

A.D
  • 2,114
  • 3
  • 17
  • 27

2 Answers2

12

After reviewing the equations a few more times. I think the correct loss is the following:

$$\mathcal{L} = (11.1 - 4.3)^2$$

My reasoning is that the q-learning update rule for the general case is only updating the q-value for a specific $state,action$ pair.

$$Q(s,a) = r + \gamma \max_{a*}Q(s',a*)$$

This equation means that the update happens only for one specific $state,action$ pair and for the neural q-network that means the loss is calculated only for one specific output unit which corresponds to a specific $action$.

In the example provided $Q(s,a) = 4.3$ and the $target$ is $r + \gamma \max_{a*}Q(s',a*) = 11.1$.

A.D
  • 2,114
  • 3
  • 17
  • 27
0

TLDR:

Probably won't matter unless you have a large action space.

If your loss function is MSE, then the calculated loss is half of the term specific loss (if action space = 2). This may matter if your action space is large and may slow down training since the slope of the loss function is reduced by a factor equal to the action space of your problem.

        next_q = self.model.predict(next_obss)
        next_q[np.where(dones)] = np.zeros([self.action_shape])

        qs = self.model.predict(obss)
        qs[range(len(qs)), actions] = rewards + GAMMA * np.max(next_q, axis=1)

        h = self.model.fit(obss, qs, verbose=0)

As you mentioned, only the q values which correspond to the current action performed are updated. Therefore, the loss numerator remains constant.

Assuming an action space of 2( possible values: {0,1}).

L = 1/2[ Q - Q_old ]^2 # Capital implying Vector
L = 1/2[ (q_0 - q_old_0)^2 + (q_1 - q_old_1)^2]

If the selected action was 1 then the 0th value remains unchanged therefore, it cancels out and vice versa. Thus, all the terms cancel out except for the action currently performed. However, the denominator would keep increasing as per the action space.

For an action space of n = 2,

MSE(Q(s)) = 1/n * (squared error for Q(s,a))
EFreak
  • 101
  • 3