1

I am attempting a classification task whereby the features used to describe classes are not all being used. For instance, Class A does not use feature 2, class B does not use feature 4, and class C does not use feature 1 and 2:

f1 | f2 | f3 | f4 | class
10 | NA | 23 | 30 |   A
1  | 11 | 33 | NA |   B
11 | NA | 20 | 32 |   A
NA | NA | 55 | 50 |   C
6  | 9  | 18 | NA |   B
NA | NA | 49 | 45 |   C

One way i have gone about approaching this problem, is converting all NA's to 0's. However, i believe that this would be incorrect. For instance, both class A and Class C both do not utilize feature 2, and if we assign all values to 0 in that feature space, we will be incorrectly attempting to distinguish between class A and class C, when there is nothing to distinguish in the first place. Imputing values for those features is not an option here.(ie., The data isnt missing per se, its just not possible for class A to have feature 2)

If i go ahead and input the dataset as is, with the NA values still inside, i end up with the following error when using a Naive Bayes classifier from the naivebayes package in R:

Error in density.default(x, na.rm = TRUE, ...): need at least 2 points to select a bandwidth automatically

How should i go about dealing with this problem correctly?

  • What prompted collection of these data? What is your null hypothesis? Because you know some classes have unused categories, you know probabilities of categories can't be the same across classes. – BruceET Jan 14 '19 at 08:33
  • What are you trying to achieve? If you want to classify, then the NAs are very useful: If feature 2 is NA, then the class is either A or C, but definitely not B. If you don't have too many combinations, you can cut your dataset into chunks: collect everything where feature 2 is NA and classify this into A or C using the remaining (non-NA) features. – Stephan Kolassa Jan 14 '19 at 08:56
  • Hi, just to clarify that the example of class A,B and C is a hypothetical one. Unfortunately, I will eventually be dealing with possibly several hundreds of classes for a product i am developing, so yes there will be many combinations @StephanKolassa . As such, while developing my algorithm, I am working on smaller representative datasets such as the example given above. – Bharat Desai Jan 14 '19 at 09:08
  • OK. It mainly sounds like you have a classification task with "much" missing data, is that correct? Do [previous questions with these tags](https://stats.stackexchange.com/questions/tagged/missing-data+classification?sort=votes&pageSize=50) help? Specifically [Binary classification when many binary features are missing](https://stats.stackexchange.com/q/7982/1352)? – Stephan Kolassa Jan 14 '19 at 09:13
  • Yes, you are absolutely correct @BruceET. Would you be able to refer me to some literature or an algorithm that deals with classification with uneven class probabilities/ unused categories? – Bharat Desai Jan 14 '19 at 09:13
  • Sorry. Your objective is not clear. – BruceET Jan 14 '19 at 09:19
  • @StephanKolassa, unfortunately not helpful. Typically, most of the questions related to do with missing data, are situations whereby some of the data in the feature-space is unavailable. In most such situations, we can approximate the distribution of the data available for that feature and approximate that missing data. Here, entire features might not be relevant at all to a certain classes. As such, we see a dataset whereby NA's are consistent for some of those classes. These classes do not even have a distribution for these features, and so no imputation will work. – Bharat Desai Jan 14 '19 at 09:19
  • My other option would be to completely discard features which have NA's (that means only use features that are relevant to all classes). However that would be a waste of information, as correctly pointed out by you earlier on that the feature could possibly be used for classifying the other classes where the feature is relevant. – Bharat Desai Jan 14 '19 at 09:20
  • @BruceET My objective is a multi-class classification problem between say class A,B, and C. – Bharat Desai Jan 14 '19 at 09:21
  • It looks like you are comparing apples to oranges, there is too much diversity in your data. You may need to break your data into more meaningful sets, otherwise it just makes no sense to compare these. If object A has features 1, 2 and 3, while object B has features D, E and F, what then is there to compare? – user2974951 Jan 14 '19 at 09:47
  • Agree with @user2974951. By objective, I meant: What do you really want to know--say, for practical or commercial purposes? – BruceET Jan 14 '19 at 20:56
  • Hm, @user2974951 you seem to be alluding to an extreme case, whereby the featurespace for each class is completely unique. This is not the case here. In the more generalized scenario of the example i have listed above, i will be attempting to classify several tens or even hundreds of classes, and while i do have features that are shared amongst all the classes, there are also some that are not. Ensemble learning seems to be the best approach, whereby i train classifiers for each set of classes that share a complete set of features. However, this is tedious, & i was looking for smthn neater. – Bharat Desai Jan 15 '19 at 02:01
  • @BruceET I need to identify each class, given the complete set of features. Specifically, i am utilizing network traffic data for the purpose of IoT Device identification. For instance, https://dl.acm.org/citation.cfm?id=3019878 – Bharat Desai Jan 15 '19 at 02:16
  • Then it's cataloging, not testing. – BruceET Jan 15 '19 at 09:12

0 Answers0