You might consider reinforcement learning through the method of policy gradients.
The core trick is to have a stochastic policy $\pi(\theta)$ which samples your next action $u_k$ depending on previous states $x_{t-1}$. The sequence of states and actions forms a trajectory denoted by $\tau = [x_{0:H},u_{0:H}]$ where $H$ denotes the horizon which can be infinite. At each instant of time, the learning system receives a reward denoted by $r_{k} = r\left( \mathbf{x} _{k},\mathbf{u}_{k}\right)
\in\mathbb{R}\ $.
As you probably know the general goal of policy optimization in reinforcement learning is to optimize the policy parameters $\mathbf{\theta}\in\mathbb{R}^{K}$ so that the expected return
$$
J\left( \mathbf{\theta}\right) =E\left\{ \sum\nolimits_{k=0}^{H} r_{k}\right\}$$
is optimized. This can be done by calculating the gradient $\mathbf{\nabla}_{\mathbf{\theta}}J\left( \mathbf{\theta}\right) $ and applying a gradient descent method to find a better policy. So what is the gradient with respect to $\theta$?
Assume that trajectories $\tau$ are generated from a system by roll-outs, i.e. $\mathbf{\tau}\sim
p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) =p\left( \left.
\mathbf{\tau}\right\vert \mathbf{\theta}\right)$ with return $r(\mathbf{\tau})=\sum\textstyle_{k=0}^{H}r_{k}$ which leads to $$J(\theta) = E\left\{ r(\tau) \right\} =
\int_{\mathbb{T}} p_{\mathbf{\theta}}\left(
\mathbf{\tau}\right) r(\mathbf{\tau})d\mathbf{\tau} \ .$$
Now the policy gradient can be estimated using the likelihood ratio better known as REINFORCE trick.
$$\mathbf{\nabla}_{\mathbf{\theta}}p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) =
p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) \mathbf{\nabla
}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)$$
In our case this leads to the following equation:
$$\mathbf{\nabla}_{\mathbf{\theta}}J\left( \mathbf{\theta}\right)
= \int_{\mathbb{T}}\mathbf{\nabla}_{\mathbf{\theta}}p_{\mathbf{\theta}}\left(
\mathbf{\tau}\right) r(\mathbf{\tau})d\mathbf{\tau}
= \int_{\mathbb{T}}
p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) \mathbf{\nabla
}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)
r(\mathbf{\tau})d\mathbf{\tau}
= E\left\{ \mathbf{\nabla
}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)
r(\mathbf{\tau})\right\}.$$
Therefore, if we find the $ \mathbf{\nabla
}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)$ we can use a Monte Carlo method to calculate the expectation of $\mathbf{\nabla}_{\mathbf{\theta}}J\left( \mathbf{\theta}\right)$. Importantly, this derivative can be computed without knowledge of the generating distribution $p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)$ as
$$p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)=p(\mathbf{x}_{0})\prod\nolimits_{k=0}
^{H}p\left( \mathbf{x}_{k+1}\left\vert \mathbf{x}_{k},\mathbf{u}_{k}\right.
\right) \pi_{\mathbf{\theta}}\left( \mathbf{u}_{k}\left\vert \mathbf{x}
_{k}\right. \right)$$ which implies
$$\mathbf{\nabla}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left( \mathbf{\tau
}\right) =\sum\nolimits_{k=0}^{H}\mathbf{\nabla}_{\mathbf{\theta}}\log
\pi_{\mathbf{\theta}}\left( \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right.
\right)$$ as only the policy $\pi(\theta)$ depends on $\theta$.
By choosing a multivariate normal distribution for your policy you are able to operate in arbitrary complex continuous environments completely model free. As long as your policy is stochastic this approach works with arbitrary complex policies (e.g. big convolutional neural networks).