The comments give good explanations. I also wanted to point out that what we really want to know is, given some training data $\mathcal{T}$, what is the expected loss $L(f(x),y)$ for our model ($f$) trained on this data (i.e., $\mathbb{E_{\mathcal{T}}}[L(f(X),Y)]$.
However, if we only used cross-validation errors, we don't end up estimating this, but instead estimate the marginal expectation (over all possible training sets) (i.e, $\mathbb{E}_{XY}[L(f(X),Y)])$
Why? Because each CV-fold can be thought of as a bootstrap sample from your available data, which approximates a draw from the overall data population (this is the key idea behind the bootstrap). Therefore, you are training your model on different samples that are approximately from the overall data distribution. Since CV effectively averages over all possible training sets (only approximately, just like in bootstrapping), the average CV loss will not reflect the expected loss for the model trained using only your full training set (i.e. ignoring the loss from training sets you did not actually get, but could have).
So, the only way to really get at conditional expected loss is to train the model on your data and then test it against a lot of new data. The use of a third set of fresh data will help you approximate this.