Where is my picture of off-policy RL wrong?

Question

In off-policy reinforcement learning, we keep two separate networks, one for generating behavior and one as our target network. The behavior net uses some means of exploration like an epsilon greedy policy to generate a set of data which we can optimize it on. Every so often we copy the parameters to the target net.

Where is this view wrong?

Some questions I don't have a grasp on:

Do we ever optimize the target net, or it only ever receives parameter updates via copying the behavior net parameters?
Isn't the target net in general going to be less optimal than the behavior net, because the behavior net is continuously improving through via the policy improvement theorem?

Seems like with my myopic perspective, the target net is only ever "checkpointing" the parameters from the behavior net

score 1 · Accepted Answer · answered Jul 16 '21 at 23:43

Addressing your questions in order. I think your overview (1st paragraph) is essentially correct; the "checkpointing" intuition is how I think of this as well.

A point of confusion in your question seems to be "why have a target net at all" if I'm reading this correctly, so I'll explain this as well.

Do we ever optimize the target net, or it only ever receives parameter updates via copying the behavior net parameters?

I'd say that receiving occasional parameter updates from the behavior net is how we optimize the target net. Doing the periodic target net update is a stability technique for deep Q-networks (DQN) and other neural RL methods called "Fixed Q-Targets" (see slides 39-40). However, I wouldn't worry too much about optimizing the target net -- the purpose of the target net is to compute estimates of the $TD(0)$ target independent of the behavior net weights.

A small note on on vs. off-policy. As a point of clarification, I don't think the separation of target and behavior nets makes this an off-policy learning technique; it's that our "next actions" used for calculating the target values aren't tied to a preset policy, but selected $\varepsilon$-greedily (i.e. we're exploring). The top answer here does a much better job of explaining this.

Isn't the target net in general going to be less optimal than the behavior net, because the behavior net is continuously improving through via the policy improvement theorem?

Perhaps, but that isn't the point of having a target vs. behavior net. In any case, I'm assuming you're in the deep neural regime based on the "nets" in your figure, so I don't think you can still assume continuous improvement. Specifically, we're only approximating the Bellman backup instead of computing it exactly (as in tabular RL).

Okay, now we'll discuss the target net-behavior net separation.

Fixed Q-Targets: Why Have a Target Net At All?

Preliminaries

Recall that DQNs are a function approximation technique where our approximator is a neural network. Let's pretend that we have only one network; we'll use $\mathbf{w}$ to represent the neural network weights. Given an experience tuple $(s, a, r, s')$, the target is our standard $TD(0)$ target, i.e. $$r + \gamma \underset{a'}{\max} \hat{Q}(s', a'; \mathbf{w}),$$

which we're trying to match to $$\hat{Q}(s, a; \mathbf{w}).$$ As is standard for regression (we're regressing $Q$), we will use MSE loss. A little calculus gives us our update

$$\Delta\mathbf{w} = \alpha (r + \gamma \underset{a'}{\max} \hat{Q}(s', a'; \mathbf{w}) - \hat{Q}(s, a; \mathbf{w})) \nabla_w \hat{Q}(s, a; \mathbf{w}).$$

Non-fixed Q-targets can cause training instability

However, if we update $\mathbf{w}$, the target $r + \gamma \underset{a'}{\max} \hat{Q}(s', a'; \mathbf{w})$ and our estimate $\hat{Q}(s, a; \mathbf{w})$ are correlated, so both values shift. This is because we also have to estimate the $TD(0)$ target. I've also heard it worded this way [ex. 1, ex. 2, ex. 3 (ft. a fun cow graphic)]: the network is "chasing" its own targets.

To alleviate this problem, we can remove the dependency of the target on $\mathbf{w}$; thus, the target values are now constant with respect to $\mathbf{w}$, the parameters that we're optimizing. In practice, every $k$ steps, we'd copy $\mathbf{w}$ to an identical network, and take as our update (calling the copied parameters $\mathbf{w}^-$)

$$\Delta\mathbf{w} = \alpha (r + \gamma \underset{a'}{\max} \hat{Q}(s', a'; \color{red}{\mathbf{w}^-}) - \hat{Q}(s, a; \mathbf{w})) \nabla_w \hat{Q}(s, a; \mathbf{w}).$$ I've highlighted the part that changed in red. Of course, to help improve the target values (which we also have to estimate using our network), we have to periodically update the target network.

For the reasons outlined above, "checkpointing" the behavior net weights using periodic copy-operations to the target net is a key contributor to DQN training stability. Without this, DQNs often fail to converge or can even underperform linear VFA (see 49-50).

Where is my picture of off-policy RL wrong?

1 Answers1

Fixed Q-Targets: Why Have a Target Net At All?

Preliminaries

Non-fixed Q-targets can cause training instability