How to evaluate multiple data imputation?

Question

I am performing data imputation of multiple time-series using various ML techniques (such as multiple imputation, iterative imputation). I have a matrix of ~100,000 observations (rows) of 34 stations (columns) where data is missing in intervals of different lengths. The observations are in a frequency of every 30 minutes and about 90% of the data is missing in 1-4 subsequent observations (i.e, the length of the missing interval is typically between 30 minutes and 2 hours). However, some missing intervals are longer - even a month or a year. I would like to somehow evaluate the different models I am using, in a way that accounts for the size of the missing interval. So far I performed 10-fold cross validation on the non-missing data only, of course (i.e., each fold I set to NaN 1/10 of the non-missing data and perform the imputation on ALL data, and then evaluate on this part). However, this only allows me to shuffle all non-missing data and creates missing values of only a short interval length (30 minutes). I would like to evaluate how good a model would impute a longer missing interval - like of a missing month.

I thought about sampling a desired chunk size (from the non-missing data) many times (with replacement) and evaluate based on that. But many questions I could not find an answer for arose:

Let's say I randomly generate missing intervals of one month (that is 1440 missing subsequent observations for each station (column)). Assuming about 20% on average is missing in each column, I have for each column about 80,000 non-missing observations that can be used for training. 1440 samples out of 80,000 is not much for evaluation I think. But what is the right number?
Unlike randomly spread 1440 missing observations, when I sample a chunk of 1440 subsequent observations, they are not representative of the population. On the other hand, my imputation deals with such cases anyhow.
Assuming this is ok to do, what would be a reasonable number of times to resample each chunk size? my computational resources are quite limited.
Is it considered a repeated holdout? bootstrapping? do I need to use the 0.632+ rule?

Many thanks in advance.

Welcome to the site. Have you considered using [time series cross-validation](https://robjhyndman.com/hyndsight/tscv/)? Under the assumption that you are studying an ergodic process, you can use complete intervals to evaluate the performance of an imputing algorithm. — Frans Rodenburg, Jun 28 '20 at 23:36
Thanks for your response. I have considered it, yes. But, I think it is not the solution to my problem, since it is for autoregression models, which are univariate models. I use IterativeImputer in python (similar to MICE and MissForest in R). The IterativeImputer models each feature (column) with missing values as a function of other features and uses that estimate for imputation. Therefor, theoretically, there shouldn't be any difference between the results if I shuffle my observations (rows) for example. — iditbela, Jun 29 '20 at 07:12

How to evaluate multiple data imputation?

0 Answers0