1

I'll be doing my thesis soon on model drift detection and possible remedies in a production environment. I'll probably be making an intuitive (hopefully!) theoretical framework with various types of model drift, root causes and solutions. Later on I'll apply the framework on a handful of real datasets in a real ML shop integrating it in their architecture. Many drift scenario's are pretty easy to detect and solve (for example the distribution of a feature changing significantly over time can just be monitored on a set of dashboards).

The one specific scenario that I've been thinking about recently is concept drift, a scenario where the target variable's properties change over time.

The best example I have is the following: you build a classifier to determine if an outfit is fashionable or not. Initially you manually label a set amount of pictures and use that to train and test the model. Imagine if the distribution of all input features remain stable over time. What does change though is the 'definition' of fashionable or not, the same inputs features (an outfit) could be great in 1980 but not so much in 2021, if the model remains stable there would be a substantial drop in accuracy over time (I don't think shoulder pads and colourful tracksuits are in right now anymore!). There's also no straight-forward automated way to retrain the model without manually relabelling a substantial set of the data.

Now part of the research I want to do is determine if active learning can be used to incrementally retrain the model and make it resistant to concept drift. The strategy I'm thinking about could be something along the lines of the model predicting the labels of the new instances. After that I'd use one of the various query strategies of the active learning literature to select n data points (size to be determined) that will be manually labelled. These will be then added into the training set. At this point all the previous data points in the training set can be viewed as unlabeled (this is one approach under consideration, another thing I could do is only set training instances that are similar to ones where the model and manually labeled date disagree to unlabeled) and then the labels of the manually labelled instances can be propagated to the rest of the training set using say a Nearest Neighbour approach. This is just a rough and simplified idea of what I'm thinking of, it's still subject to change and there's still many issues I can think of such as how robust this approach is to noisy examples.

Now concretely, I've seen a handful of papers on specifically this topic which I will read over the coming weeks. Have any of you dealt with similar issues in your professional or academic careers? Are there other, easier solutions to this problem?

For me the key "parameters" I'd want to keep in balance are not having an extremely convoluted/custom solution (should be explainable to business), not having to manually label a ton and generally having a higher accuracy than doing nothing.

Zestar75
  • 11
  • 1

0 Answers0