Classification with partially "unknown" data

Question

Suppose I want to learn a classifier that takes a vector of numbers as input, and gives a class label as output. My training data consists of a large number of input-output pairs.

However, when I come to testing on some new data, this data is typically only partially complete. For example if the input vector is of length 100, only 30 of the elements might be given values, and the rest are "unknown".

As an example of this, consider image recognition where it is known that part of the image is occluded. Or consider classification in a general sense where it is known that part of the data is corrupt. In all cases, I know exactly which elements in the data vector are the unknown parts.

I'm wondering how I can learn a classifier that would work for this kind of data? I could just set the "unknown" elements to a random number, but given that there are often more unknown elements than known ones, this does not sound like a good solution. Or, I could randomly change elements in the training data to "unknown", and train with these rather than the complete data, but this might require exhaustive sampling of all combinations of known and unknown elements.

In particular I am thinking about neural networks, but I am open to other classifiers.

https://en.m.wikipedia.org/wiki/Missing_data might be a place to start. — Hatshepsut, Sep 02 '15 at 23:58
I think that semi-supervised learning is more the case where the training data is not fully labeled. In my case, all my training data is labeled, but individual parts of the test data are "unknown". — Karnivaurus, Sep 03 '15 at 11:16
Semi-Supervised Learning with Ladder Networks: https://github.com/CuriousAI/ladder — itdxer, Nov 26 '16 at 07:56

score 3 · Answer 1 · answered Nov 26 '16 at 07:36

I think there's a reasonable way to make it work with Neural Networks.

Let your value for unknown be 0. Now in training you pick an input and randomly put some of its values to 0 with probability $p$, where p is your expected fraction of missing inputs at test time. Note that the same input at different iterations will have 0s at different positions.

I haven't seen it done before but this would be very similar to doing Dropout (a well known regularization method in Neural Networks) in your input neurons, instead of the hidden neurons. I don't think it's a good idea to do it in general, but if you're forced to (like your case), at least it's close enough theoretically to something that's been known to work.

score 1 · Answer 2 · answered Sep 03 '15 at 06:44

I think there are some choices that work with any classifier:

Impute the missing values with a single value, like the mean or median from the training set or some value predicted from the observed parts of the input, or just use a random number or a constant.
Use several different values for the unknowns and aggregate the results, e.g. average them

Apart from that you could use tree based classifiers (e.g. random forests) and if a tree needs to evaluate a split on a missing feature, it could just pass the data down to both child nodes.

A third option is to use a generative classifier that models the full joint distribution $p(x,y)$ where $x$ are your inputs and $y$ the classification label. With that, you would ideally marginalize over the unknown parts of $x$, i.e. you would try any value for the unknown parts of $x$ and average the outcomes weighted by the probability of that imputation. This could be done either in analytically in closed form for some classifiers, e.g. a Linear Discriminant Analysis model, or approximately by sampling the unknowns, e.g. for a Restricted Boltzmann Machine or the deep variants thereof (which are related to feed forward neural networks).

i don't think it'd work. Use a typical example from computer vision, each pixel of an image may be associated with different part of an object. Example, pixel (50,50) of image 1 is the eye of a cat, but the cat is moved a little in image 2, so (50, 50) is just a pixel of the background. If the location of NAs, ie. random occlusion, varies by observation, your imputation won't work. — horaceT, Jun 10 '16 at 23:58

score 0 · Answer 3 · answered Aug 24 '20 at 00:10

This solution is pretty similar to that of etal and the only change I suggest to add is adding more features indicating missing values.

For each of your feature $X_i$, give two features $[X_i^1 X_i^2]$. $X_i^1$ is the real value and will be the mean (or other your choices) if it is missing. $X_i^2$ is 1 if the value is not missing and 0 if it is missing. So, your input vector will be double the size.

Similar to etal's answer, you need to make sure your training data represent the testing set. While loading each training sample, apply a transformation to random put missing values with probability $p$.

Classification with partially "unknown" data

3 Answers3

Linked