I am performing data imputation of multiple time-series using various ML techniques (such as multiple imputation, iterative imputation). I have a matrix of ~100,000 observations (rows) of 34 stations (columns) where data is missing in intervals of different lengths. The observations are in a frequency of every 30 minutes and about 90% of the data is missing in 1-4 subsequent observations (i.e, the length of the missing interval is typically between 30 minutes and 2 hours). However, some missing intervals are longer - even a month or a year. I would like to somehow evaluate the different models I am using, in a way that accounts for the size of the missing interval. So far I performed 10-fold cross validation on the non-missing data only, of course (i.e., each fold I set to NaN 1/10 of the non-missing data and perform the imputation on ALL data, and then evaluate on this part). However, this only allows me to shuffle all non-missing data and creates missing values of only a short interval length (30 minutes). I would like to evaluate how good a model would impute a longer missing interval - like of a missing month.
I thought about sampling a desired chunk size (from the non-missing data) many times (with replacement) and evaluate based on that. But many questions I could not find an answer for arose:
- Let's say I randomly generate missing intervals of one month (that is 1440 missing subsequent observations for each station (column)). Assuming about 20% on average is missing in each column, I have for each column about 80,000 non-missing observations that can be used for training. 1440 samples out of 80,000 is not much for evaluation I think. But what is the right number?
- Unlike randomly spread 1440 missing observations, when I sample a chunk of 1440 subsequent observations, they are not representative of the population. On the other hand, my imputation deals with such cases anyhow.
- Assuming this is ok to do, what would be a reasonable number of times to resample each chunk size? my computational resources are quite limited.
- Is it considered a repeated holdout? bootstrapping? do I need to use the 0.632+ rule?
Many thanks in advance.