The following is a question from an exam paper on evaluating the performance of search engines. To this day I looked in my text book and literally close to 50 web pages and I can't
find one convincing argument for any of the cases. Can anyone help shed a light on this?
You have developed a new retrieval algorithm and want to evaluate its performance. To this end, you have crawled one billion webpages. The experiments take too long with your current infrastructure, so you randomly sample 10% of the data, run 100 queries on the sample and ask human subjects to assess the relevance of the top 100 results. After averaging, you observe the following mean recall and precision at different ranks:
Rank: 1 2 3 4 5 ... 10 20 ... 50 ... 100
Recall: 0.09 0.15 0.20 0.25 0.30 ... 0.50 0.70 ... 0.90 ... 1.00
Precision: 0.90 0.75 0.67 0.63 0.60 ... 0.50 0.35 ... 0.18 ... 0.10
Consider re-running the same experiment without sampling the data. Do you expect the following numbers to increase, decrease or stay the same:
i. Recall at rank 10.
ii. Precision at rank 10.
iii. Precision at 50% recall.
iv. Mean average precision.
v. Area under the ROC curve