Precision, Recall and area under ROC curve as sample size increases

Question

The following is a question from an exam paper on evaluating the performance of search engines. To this day I looked in my text book and literally close to 50 web pages and I can't
find one convincing argument for any of the cases. Can anyone help shed a light on this?

You have developed a new retrieval algorithm and want to evaluate its performance. To this end, you have crawled one billion webpages. The experiments take too long with your current infrastructure, so you randomly sample 10% of the data, run 100 queries on the sample and ask human subjects to assess the relevance of the top 100 results. After averaging, you observe the following mean recall and precision at diﬀerent ranks:

Rank:        1   2     3    4    5  ...  10   20  ...  50  ... 100
Recall:    0.09 0.15 0.20 0.25 0.30 ... 0.50 0.70 ... 0.90 ... 1.00
Precision: 0.90 0.75 0.67 0.63 0.60 ... 0.50 0.35 ... 0.18 ... 0.10

Consider re-running the same experiment without sampling the data. Do you expect the following numbers to increase, decrease or stay the same:

i. Recall at rank 10.
ii. Precision at rank 10.
iii. Precision at 50% recall.
iv. Mean average precision.
v. Area under the ROC curve

But none of the human subjects reviewed the billion-minus-hundred pages *not* flagged as relevant. You have no idea the negative fraction. — AdamO, Jul 15 '20 at 16:33

score 0 · Answer 1 · answered Jan 21 '19 at 19:41

The question doesn't provide enough information about the new algorithm and the queries and it is uncertain what the effect of using all the data would be

If the decrease in precision is caused by the algorithm performing poorly for lower ranks, using additional data might not affect categories i-v. If there are 1,000 relevant pages for each query and the algorithm only finds 1 or 2, it might not do any better in the lower ranks if there are 10,000 relevant pages.

If the precision is poor in the lower ranks because there are few valid results for each query, even in the entire data set of 1 billion pages, then using all available data would likely have a significant impact on i-v. If there was only 1 or 2 relevant pages and the algorithm found both, it might do much better for the lower ranks if there were 10 or 20.

I suspect far more can be deduced than you suggest, because the question really is about the expected relationships between a *simple random sample* and the *parent population* for each of the five listed statistics. — whuber, Jan 21 '19 at 20:07

Precision, Recall and area under ROC curve as sample size increases

1 Answers1