Statistical test to justify looking at a subset of data or the entire data set

Question

For a school project, we are asked to determine the expected pass rate of a driving test based on yearly data provided from 2010 to 2019. I am wondering if there is any statistical justification to determine whether I should look at the data for a particular year or several years.

I imagine I can begin by using hypothesis testing to test if mean pass rate for different years are significantly different. If not then they might share the same mean and therefore it is sensible to looking at the years combined.

However, I feel there should be some better/ formal way to investigate this but this was not visited at the moment during my classes.So I will be really grateful if anyone can shed some light on the issue.

To elaborate further, if I want to justify using 2019 mean pass rate as the expected pass rate for someone who will take the test this year instead of the mean rate from 2010 to 2019, how might one formally do that in the statistical sense. Or is it always better to include more data?

Data-set: https://www.gov.uk/government/statistical-data-sets/car-driving-test-data-by-test-centre

Can you please elaborate on "whether I should look at the data for a particular year or several years" ? What do you mean by that ? — Kolmogorov, Nov 15 '20 at 18:08

score 0 · Accepted Answer · answered Nov 15 '20 at 20:52

If I understand you correctly, you are asking what could be the formal justification of using a particular dataset, of subset of it, for making predictions. You are asked to make prediction about future, there is no way to formally compare the past data to the future data, because you have no access to such data. The justification is always informal, or semi-formal, taking into account things like stability of this data (e.g. human height does not change rapidly over time, but stock prices fluctuate a lot), if there are any outliers (e.g. you know that some of the data was biased, because some of the records were simply lost), etc. You are looking for the data that is possibly most similar to what you are predicting (e.g. close in time), but also in most cases having more data leads to better results, so there may be a trade-off.

One semi-formal approach may be something like one-step-ahead cross-validation, or similar approach, where you would take subset of past data, say years 2011-2018 and predict for 2019, and 2010-2017 to predict for 2018, etc., you could experiment and compare it to using only 2016-2018 to predict 2019, and 2015-2017 to predict 2018 etc. this should give you some impression on what happens if you use subsets of different size and if going far away in time actually improves the results or not.

Statistical test to justify looking at a subset of data or the entire data set

1 Answers1