Data leakage concern in a binary classification problem

Question

I have a binary classification problem (where 1 = broken and 0 = not broken) for machine engines under study. There are 25 continuous features over which I use to make predictions of 1 or 0 using random forest (RF). These 25 features in addition to their class (1 or 0), 25+1 features, can be time phased; meaning I know for a certain machine on any day, its features and class. Some of the features (e.g. time in service) increase monotonically.

I have to simulate how we would deploy/operationalize this RF model in the lab. Our deployment method would be to learn the RF model from all the machines up to and including yesterday, and then use the RF model to make predictions on a subset of those machines today (the machines for which I am making a prediction today are the ones that did not break down yet today, the current working machines).

Is there any data leakage concern inherent with this modeling/learning approach? Based on sensitivity and specificity, I got good results (> 82%) and I am afraid that's due to inadvertent data leakage.

The reason being is that for a machine, its features in the training set (yesterday) might be identical to the validation set (today), because these features change slowly over time.

Now, what if I learned my RF model from all machines from the state they were in 1 year ago, and then made predictions on the working machines' state today? 1 year would definitely be sufficient time to allow for the features of these machines to change. Would this "simulation" help to understand if my predictive model would really generalize into the future?

Would you know at the time of prediction who long has the machine `x` been up and running? If yes, it would make sense to use the up-time of the machine as a feature. In general, I would suggest you design a procedure to generate your training/test as if you are try to make prediction under real conditions. It is a bit laborious but it ensures you fully control all the information that the algorithm should have available at time of prediction. — usεr11852, Nov 30 '17 at 20:51
It's worth your time to accurately define your predictions here. They are binary, but what does that mean? Are you asking if it will break tomorrow, given the state today? If yes, then for training you are predicting breaking tomorrow given today, and then throwing out all data on subsequently broken machines. So if machine A breaks on December 2nd, you will throw out all data on machine A after December 2nd, because the machine is now permanently broken. — Alex R., Nov 30 '17 at 21:06
@AlexR. I'm not sure what you mean by throwing out all data on subsequently broken machines. For training, we would look at all the features for broken (positive examples) and unbroken (negative examples) machines, and throw away nothing. For prediction, we only want to predict on current working machines to see their probability of being broken (or categorical classification). — Jane Wayne, Nov 30 '17 at 21:48
@usεr11852 So the way that I'm learning (from all machines up to yesterday) the predictive model to make predictions (on all current working machines today) is the way we have (for now) decided to operationalize. The only problem is that, not for all, but for some machines (especially machines still working today and for which we are interested in predicting), their features from yesterday compared to today may be nearly identical. — Jane Wayne, Nov 30 '17 at 21:52
@JaneWayne: The fact that the features themselves do not change drastically is not "bad" if it indeed reflects something real. That said, you need to be careful not to fit "noise". Definitely check the variable importance in your model as well as cross-validate your results. — usεr11852, Nov 30 '17 at 22:06
@usεr11852 How would knowing variable importance help related to my concern? On the cross-validation, we've done 10-fold cross-validation on the training data itself; is that what you mean? — Jane Wayne, Nov 30 '17 at 22:15
1. Data leakage is about information that should not be available. Checking your "top predictors" is crucial as these will be the one likely to contain the data leakage. In addition if your top predictors are... totally off the wall, I would question the model's validity too. 2. Yes, I would suggest you used repeated cross-validation instead of a single run. — usεr11852, Nov 30 '17 at 23:58
An acid test for your model is to compare with the accuracy of a naive model. If it is close to 82% accuracy, then you know what it means. Also, given your problem, a suggestion would be to try out Markov chain decision process - like policy iteration or value iteration. — honeybadger, Dec 01 '17 at 00:34
@kasa What do you mean by a naive model? I've tried logistic regression, knn, naive bayes, multilayer perceptron and support vector machines, and they did not perform as well. the performance measure was area under the (ROC) curve. — Jane Wayne, Dec 01 '17 at 05:25

Data leakage concern in a binary classification problem

0 Answers0