Reinforcement Learning in a Continuous Environment

Question

Many of the tutorials on reinforcement learning use the example of a robot navigating a discrete grid environment or a game of chess to demonstrate the principle of learning with delayed rewards. Often Q-learning is represented as a table listing the optimal outcome for each state.

Obviously for many situations, the environment may not be discrete but continuous. How does the Q-learning approach work, if at all, in a continuous environment.

The example I am trying to understand is buying and selling stocks on the stock market. Here, clearly the input features (price etc) would be continuous an the reward (profit) is dynamic.

I understand that neural networks can be used to try to predict future rewards but I haven't been able to understand how Q-learning applies to back-propagation.

Related: How to fit weights into Q-values with linear function approximation

Harald Thomson · Accepted Answer · 2016-02-19T07:11:50.010

You might consider reinforcement learning through the method of policy gradients.

The core trick is to have a stochastic policy $\pi(\theta)$ which samples your next action $u_k$ depending on previous states $x_{t-1}$. The sequence of states and actions forms a trajectory denoted by $\tau = [x_{0:H},u_{0:H}]$ where $H$ denotes the horizon which can be infinite. At each instant of time, the learning system receives a reward denoted by $r_{k} = r\left( \mathbf{x} _{k},\mathbf{u}_{k}\right) \in\mathbb{R}\ $.

As you probably know the general goal of policy optimization in reinforcement learning is to optimize the policy parameters $\mathbf{\theta}\in\mathbb{R}^{K}$ so that the expected return

$$ J\left( \mathbf{\theta}\right) =E\left\{ \sum\nolimits_{k=0}^{H} r_{k}\right\}$$

is optimized. This can be done by calculating the gradient $\mathbf{\nabla}_{\mathbf{\theta}}J\left( \mathbf{\theta}\right) $ and applying a gradient descent method to find a better policy. So what is the gradient with respect to $\theta$?

Assume that trajectories $\tau$ are generated from a system by roll-outs, i.e. $\mathbf{\tau}\sim p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) =p\left( \left. \mathbf{\tau}\right\vert \mathbf{\theta}\right)$ with return $r(\mathbf{\tau})=\sum\textstyle_{k=0}^{H}r_{k}$ which leads to $$J(\theta) = E\left\{ r(\tau) \right\} = \int_{\mathbb{T}} p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) r(\mathbf{\tau})d\mathbf{\tau} \ .$$

Now the policy gradient can be estimated using the likelihood ratio better known as REINFORCE trick.

$$\mathbf{\nabla}_{\mathbf{\theta}}p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) = p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) \mathbf{\nabla }_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)$$

In our case this leads to the following equation:

$$\mathbf{\nabla}_{\mathbf{\theta}}J\left( \mathbf{\theta}\right) = \int_{\mathbb{T}}\mathbf{\nabla}_{\mathbf{\theta}}p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) r(\mathbf{\tau})d\mathbf{\tau} = \int_{\mathbb{T}} p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) \mathbf{\nabla }_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) r(\mathbf{\tau})d\mathbf{\tau} = E\left\{ \mathbf{\nabla }_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) r(\mathbf{\tau})\right\}.$$

Therefore, if we find the $ \mathbf{\nabla }_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)$ we can use a Monte Carlo method to calculate the expectation of $\mathbf{\nabla}_{\mathbf{\theta}}J\left( \mathbf{\theta}\right)$. Importantly, this derivative can be computed without knowledge of the generating distribution $p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)$ as

$$p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)=p(\mathbf{x}_{0})\prod\nolimits_{k=0} ^{H}p\left( \mathbf{x}_{k+1}\left\vert \mathbf{x}_{k},\mathbf{u}_{k}\right. \right) \pi_{\mathbf{\theta}}\left( \mathbf{u}_{k}\left\vert \mathbf{x} _{k}\right. \right)$$ which implies

$$\mathbf{\nabla}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau }\right) =\sum\nolimits_{k=0}^{H}\mathbf{\nabla}_{\mathbf{\theta}}\log \pi_{\mathbf{\theta}}\left( \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right. \right)$$ as only the policy $\pi(\theta)$ depends on $\theta$.

By choosing a multivariate normal distribution for your policy you are able to operate in arbitrary complex continuous environments completely model free. As long as your policy is stochastic this approach works with arbitrary complex policies (e.g. big convolutional neural networks).

Reinforcement Learning in a Continuous Environment

1 Answers1