0

Suppose I have a dataset $X$ that contains both numerical and categorical features. For concreteness let's assume that one of the categorical features is a sample's color, and that it has been properly preprocessed via one-hot encoding. A property of $X$ is that about half the samples are described by one color, and the other half by two.

For instance the data could look like

Color A     Color B      Height (m)     Weight (kg)     other features

Red         Blue         0.5            1               ...
Green       NaN          0.2            1.2             ...
Purple      Red          0.3            0.5             ...
Blue        NaN          0.45           0.75            ...

I was wondering whether it would be possible to predict the most likely second color for monochromatic samples given the information contained in $X$, and if so what is the best way to frame this question as a statistical/machine learning model?

This problem seems related to clustering, although I have a few issues with taking this point of view. If I were to naively cluster the samples in $X$ that have two colors, then nothing guarantees that the clusters would be based on the color feature only. In fact clustering would most likely group samples with different colors together, invalidating my goal from the start.

Another point of view would be to treat the monochromatic samples as having missing data. I have heard that Expectation Maximization can be used to replace missing data, but it still does so by clustering data using a mixture model, and I go back to my argument in the previous paragraph.

Any guidance on how to approach this problem, if possible, would be greatly appreciated.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
physguy
  • 131
  • 2
  • I'm confused by your description. If color is a single feature, such that you can dummy-code it, how can a single case have two colors? (Don't use the word "samples" to mean "cases" because in statistics, a "sample" means a set of cases, not a single case.) – Kodiologist Jul 01 '17 at 01:11
  • @Kodiologist You can think of it as being two separate features, one of them being `primary color` and the other `secondary color`, with the latter being NaN for monochromatic observations. It's then straightforward to dummy-code the combination of the two so that some observations effectively have two colors. I'm intrigued in knowing what the most likely secondary color would be based on the information contained in $X$. – physguy Jul 01 '17 at 01:22
  • Is there anything to distinguish a case with primary color A and secondary color B from a case with primary color B and secondary color A? Or is the distinction arbitrary? – Kodiologist Jul 01 '17 at 01:29
  • Add example data to make your question easier to understand. As far as I can tell, you want a classifier for the second color, given the first? Then just train a separate classifier for each value of X! Also I disagree that one-hot encoding is "proper"... that is a hack. – Has QUIT--Anony-Mousse Jul 03 '17 at 08:31
  • 1
    I am not clear why this is not simply a classification problem to predict the second color. – G5W Jul 04 '17 at 01:12
  • @Kodiologist The distinction between color A and color B is arbitrary; both should be treated on the same footing. – physguy Jul 04 '17 at 20:03
  • @G5W I technically could train a classifier on one color to predict the other, but since both colors are on equal footing I would have to duplicate my two-color cases to account for the complete correlations $\text{color}_A \rightarrow \text{color}_B$ and $\text{color}_B \rightarrow \text{color}_A$. – physguy Jul 04 '17 at 20:04
  • @Anony-Mousse I have added example data. Classifying a color based on the other treats them on separate footing, whereas I am wondering whether there is a way to treat them as equal. Unless the distinction is unnecessary. – physguy Jul 04 '17 at 20:11
  • I guess it confuses me how cases can have no more than two colors, but you don't have a way to distinguish primary from secondary colors. Try providing [the context of your real problem](http://arfer.net/w/statqgl), which, I suspect, doesn't involve color. – Kodiologist Jul 04 '17 at 21:33
  • 1
    Since you say certain values of color A imply no color B, there does appear to be a relationship? And by any means, I cannot see any relationship to clustering here. – Has QUIT--Anony-Mousse Jul 05 '17 at 06:12
  • If you know that a sample is monochromatic, doesn't this imply that there is no 2nd color? Trying to infer it would seem to be willfully throwing away knowledge about the problem (that a 2nd color doesn't exist). Alternatively, is this actually a missing data problem? I.e. all samples are known to have 2 colors, but the 2nd hasn't been observed in some cases. – user20160 Aug 03 '19 at 23:53

1 Answers1

0

You could try multinomial logistic regression, see for example Multinomial logistic regression vs one-vs-rest binary logistic regression and search this site. Use the Color variable as a (nominal) outcome variable, and the other variables as predictors.

From the fitted model you will get fitted/predicted probabilities for each color, for each case. Then look at the second largest fitted probability.

As for the problems with missing data, look into https://stats.stackexchange.com/questions/tagged/data-imputation and https://stats.stackexchange.com/questions/tagged/multiple-imputation.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467