Determining whether new data is "in distribution" with training data

Question

I'm hoping to use machine learning to predict chemical properties of various molecules. Many chemistry machine learning research papers that I come across talk about model generalizability issues related to new molecules that are not "in-distribution" with training data.

To give an "out of distribution" example in layman's terms from an entirely different field, it's like training a self-driving car on a highway with well-marked lanes, and expecting that model to know what to do when you deploy it on a golf cart in the middle of a putting green.

This particular problem in chemistry is common because we run expensive experiments on molecules that we've never seen before -- we don't bother running experiments on molecules we already know well. However, a successful chemistry predictive model would be useful because it might help avoid costly experiments whose outcomes might be obvious to the model but not to the chemist.

In chemistry, or in any domain for that matter where the representativeness of data is much less obvious, what are some techniques to measure the degree of novelty of a new record (i.e., molecule) at interference time? For what it's worth, I'm not using simple models like linear or logistic regression, but random forests and deep learning techniques.

"In distribution" is a bid misleading, since in terms of deep learning you have multidimensional space. Question then is more to detect the difference in multidim. space. Question there is not really about "in distribution", but more like how much it differs to other common observations. Each example can be imagined as a vector in multidim. space. The more these vectors differ, the higher the difference is. Evaluating this difference is a question of defining loss. In easy case this can be mean squared error. In case of multidim. this would lead to something like categorical cross-entropy. — Stat Tistician, Jul 10 '20 at 21:19
You need to think about how to describe the new molecule. If you have already a deep learning network, you have set up the features before. So the characteristics describing the observation are already known. Think about the simple example just one feature size, you would use MSE or MAE. A deep neural network - think about natural language processing - detects which observations are close to each other (like Paris => France, without specifying it before), depending on certain feature (here in this case it's the sequence words are appearing together in a text and how close/far away they are). — Stat Tistician, Jul 10 '20 at 21:24
You might be interested in that paper: https://www.biorxiv.org/content/10.1101/2020.04.28.065052v1 It addresses a similar issue in genetics where data from different experiments is condounded by e.g. the experimental setup. In order to remove that deconfounding they train an deconfounding adversarial autoencoder to obtain a representation of the data that is not influenced by those confounders — Sebastian, Jul 10 '20 at 21:34
IMHO, the best answers lie in a deep understanding of the chemistry. That's why predicting the properties of novel molecules is so difficult: it's not a problem statistics can solve on its own. — whuber, Jul 10 '20 at 21:52
@whuber I agree that a deep understanding of chemistry is probably the best approach on a per-unit basis, but human experts don't scale if you need to manually review hundreds of thousands of molecular structures in a high throughput screen (HTS). A model's lower quality results are still very useful if it can flag "obvious" reactions cheaply and quickly, so that experts can focus on the small percentage of difficult cases. Experts and models are complementary. QSAR modeling, virtual screening, and "in silco" machine learning approaches have been in use for several decades in chemistry — Ryan Zotti, Jul 10 '20 at 22:30

score 0 · Answer 1 · answered Jul 14 '20 at 19:28

The idea you try to describe is defined as "novelty detection" and there exist many approaches to tackle it.

All you need to do is "featurize" your chemistry data (temperatures, length of bonds, volumes, other molecule behaviors, energies etc.)

Then your options are many. Here are some:

1./ LOF Algorithm: (L)ocal (O)utsiders (F)actor measures the local deviation of a given data point with respect to its neighbors. So when you compute your LOF score for a point you can tell if this is an outlier, thus if differentiates from its neighbors.

2./ Isolation Forest modeling

3./ t-tests

If t-test shows significant difference between two sets of measurements (normal profile and test profile), then the second set is considered to contain novel patterns

4./ Principal Component Analysis (PCA)

PCA can be used for detecting novel patterns in data

and counting...

I ll try to add to this list as time progresses.

Novelty detection is one approach, but I don't believe it directly addresses the issue of describing how well the model will generalize to the new data. — Ryan Volpi, Jul 15 '20 at 16:20
Usually in Regression we use `model capacity` to describe that. Higher the capacity the more expressive the model (i.e. it can accommodate more variation). But that is one of the very fundamental challenges of ML: Does my model truly generalize? So capacity needs to be tuned with respect to the amount of data at hand. Now novelty should not identify with the concept of generalization 100% That is why I didn't connect it to my answer. Thanks — pebox11, Jul 15 '20 at 16:59

Determining whether new data is "in distribution" with training data

1 Answers1