How to check if a Machine Learning model is applicable for newly input data?

Question

Suppose we have a good Machine Learning model, with good cross-validation and test score.

How can we estimate whether a newly input data instance belongs to the domain of data where model predictions are dependable?

To give an example: It is impossible to train a self-driving car with data from from every situation it will ever face. How can we identify situations where the model will not work well?

I can imagine simple methods to estimate if new data is similar to training data, e.g., based on nearest neighbor distance. However such a solution might not account for feature importance and hence not optimal.

I am looking for a more systematic discussion of solutions to this question.

As written in (3), we can measure how similar a new data instance is to the training data, and have some idea about model performance based on that. I am wondering if we can do better. — amos696, Aug 28 '19 at 10:05

score 0 · Answer 1 · answered Aug 28 '19 at 11:07

Your problem is called differently depending on the application domain: "novelty detection", "rejection", "outlier detection". Formally, you can describe it as a two class classification problem (old versus new) where training data is only availbale for class old.

Concerning your suggestion 3), I have described a possible approach based on kNN distances in section 5 of "Reject Options and confidence measures for kNN classifiers".

How to check if a Machine Learning model is applicable for newly input data?

1 Answers1