policy gradient for non-differentiable policy

Question

Is it possible to apply policy gradient if the parameters of policy are not differentiable? If not, is there any other algorithm for optimizing such type of policies?

One example I'm thinking about is a hard boundary: if $W^T x > 0$ then take action $a_0$, and if $W^T x \leq 0$ then take action $a_1$. Here the parameter is the vector $W$ and the policy is not differentiable.

I believe this question is kind of general, as most deterministic policies should be non-differentiable with respect to their parameters.

score 2 · Answer 1 · answered Feb 17 '20 at 04:55

2

You could try a straight-through estimator of the gradient, $\frac{\partial \ \text{sign}(x)}{\partial x} = 1$. You could also try to train a stochastic policy $\pi(a_0) = \sigma(\frac{w^tx}{\tau})$ and anneal $\tau$ from 1 to 0 over time (0 corresponding to the deterministic policy).

Finally you might also try a number of related tricks for backpropagating through non-differentiable models such as VIMCO, REBAR, and RELAX.

It's kind of rare that you would want to force your policy to be deterministic -- off the top of my head I can't think of any reason actually. If you just want consistent test-time behavior, you could just fix the random seed to any stochastic policy.

answered Feb 17 '20 at 04:55

shimao

22,706
2
42
81

Could you elaborate on the first point? What does x here stand for? Isn't the derivative of signal(x) equal to 0 except for x = 0? – DiveIntoML Feb 17 '20 at 19:09
I would like to force the policy to be deterministic in order to have a better comparison with my current manually designed rules. Since I cannot generate an infinite amount of data, the stochastic component in the policy just makes the comparison more difficult even with a fixed seed – DiveIntoML Feb 17 '20 at 19:11
Although the policy from policy gradient is always a stochastic policy, isn't all policies from Q-learning deterministic? With a given state s, you get a deterministic Q-function, which gives you a fixed action using argmax. – DiveIntoML Feb 17 '20 at 19:12
1. yes, the derivative of the sign function is $0$ everywhere and undefined at $x = 0$, but it is possible to successfully use *biased estimators* of the gradient (the straight-through estimator being one of them) to train. 2. not sure why a stochastic policy with fixed seed makes it more difficult. 3. Q-learning trains deterministic policies, yes. of course it's no longer a policy gradient method. – shimao Feb 17 '20 at 20:22

policy gradient for non-differentiable policy

1 Answers1