In this slide it goes that extrinsic evaluation is time consuming, usually takes days or weeks. I have tried to understand that.
Firstly, I learned from this slide that to evaluate an n-gram model the best way is extrinsic evaluation which implies "Embed in an application and measure the total performance of the application".
Secondly, I got to know from this answer that Intrinsic evaluation is "test your model by a set of testing samples, and monitor how the model is working internally"; however extrinsic evaluation is "a model's performance can be determined by testing it using some set of testing samples (that we know their true solutions) and see how the model solves them." The later test samples should be never seen before.
So my question is if only an unseen sample set is needed for the extrinsic evaluation, like end-to-end evaluation, how can I interpret the statement that extrinsic evaluation is such time-consuming that even though it is the best choice we cannot employ it. The extrinsic test dataset used by this language model is only about some Megabits at large, and the test time is far less than days, let alone weeks.
Have I misunderstood the meaning of it?