I have a binary classification problem (where 1 = broken and 0 = not broken) for machine engines under study. There are 25 continuous features over which I use to make predictions of 1 or 0 using random forest (RF). These 25 features in addition to their class (1 or 0), 25+1 features, can be time phased; meaning I know for a certain machine on any day, its features and class. Some of the features (e.g. time in service) increase monotonically.
I have to simulate how we would deploy/operationalize this RF model in the lab. Our deployment method would be to learn the RF model from all the machines up to and including yesterday, and then use the RF model to make predictions on a subset of those machines today (the machines for which I am making a prediction today are the ones that did not break down yet today, the current working machines).
Is there any data leakage concern inherent with this modeling/learning approach? Based on sensitivity and specificity, I got good results (> 82%) and I am afraid that's due to inadvertent data leakage.
The reason being is that for a machine, its features in the training set (yesterday) might be identical to the validation set (today), because these features change slowly over time.
Now, what if I learned my RF model from all machines from the state they were in 1 year ago, and then made predictions on the working machines' state today? 1 year would definitely be sufficient time to allow for the features of these machines to change. Would this "simulation" help to understand if my predictive model would really generalize into the future?