Following David Silver's course, I came across the actor-critic policy improvement algorithm family.
It holds For one-step Markov decision processes that
$$\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[log \pi_{\theta}(s,a)*r]$$
where $J$ is the MDP's value function, and $\pi_\theta$ is the policy parameterized by $\theta$. $r$ is the rewared sampled after taking action $a$.
It also holds that (the policy gradient theorem) for some value functions $J$:
the policy gradient is
$$\nabla_{\theta}J(\theta) = \mathbb{E}[\nabla_{\theta}log{\pi_{\theta}}(s,a)Q^{\pi_{\theta}}(s,a)]$$
David there says (1:06:35 +) "And the actor moves in the direction suggested by the critic".
I am pretty sure by that he means "the actor's weights are then updated in direct relation to the critic's criticism"
$$\theta = \theta + \alpha \nabla_{\theta}log\pi_{\theta}(s,a)Q_{W}(s,a)$$
where alpha is the learning rate, $\pi_{\theta}$ is the actor's policy, parameterized by $\theta$, and $Q_w$ is the critic's evaluation function, parameterized by $w$.
So far so good.
What I am not getting (basically many aspects of the same question):
Why do we need a critic at all?
I just can't see where the critic suddenly came from and what it solves.
What is the gradient of the policy $\pi$ itself, if not "the dirction of improvement"? Why add the critic?
Why not use the same parameters for the actor and the critic? It seems to me they are actually to approximate he same thing: "How good is choosing action a
from state s
"
Why did we replace $Q^{\pi_{\theta}}(s,a)$ with an approximation by different parameters $w$, $Q_w(s,a)$. What benefit does this separation introduce?