1

I have the situation where I want to distinguish between two classes $C_1$ and $C_2$, where $C_2$ consists of three different types of subclasses $C_{2,1}$, $C_{2,2}$ and $C_{2,3}$. Also, it is easy to generate samples of type $C_{2,1}$, $C_{2,2}$ and $C_{2,3}$, but hard to generate them for $C_1$.

The basic classification problem is to classify whether given sensor signals stems from events in $C_1$ or $C_2$, and I use 6 features of the signal - such as mean, std dev, integral ... - as features for the classification algorithm.

I would be interested in advice about how to deal with this situation.

For me there are two natural approaches for distinguish between two classes $C_1$ and $C_2$:

1) Train a classifier on $n$ samples of $C_1$ and $n/3$ samples of $C_{2,1}$, $C_{2,2}$ and $C_{2,3}$ respectively.

2) Have 3 classifiers, distinguishing between $C_1$ and $C_{2,1}$, $C_1$ and $C_{2,2}$, $C_1$ and $C_{2,3}$, where each classifier is trained on $n$ samples, and then report that the outcome is $C_1$, if all three (or maybe 2) report $C_1$.

How would you approach such a situation?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
user695652
  • 1,351
  • 3
  • 15
  • 22
  • Could you provide more information on the input variables of the model? – spdrnl May 12 '15 at 14:14
  • @spdrnl Thanks for the remark, I've edited the post accordingly. – user695652 May 12 '15 at 14:20
  • If I understand correctly then you have multiple sensors. Each of these sensors gives a signal. Based on the signal, you would like to classify the signal as stemming from one of the classes. Do the classes change over time? – spdrnl May 12 '15 at 14:50
  • This question may be partially relevant, perhaps http://stats.stackexchange.com/q/17017/3277. – ttnphns May 12 '15 at 14:59
  • @ttnphns indeed. For an online setting one might consider a particle filter/hmm using the prior observed values of the classes – spdrnl May 12 '15 at 15:09
  • @spdrnl "Based on the signal, you would like to classify the signal as stemming from one of the classes", yes that's exactly it. The classes are static. – user695652 May 12 '15 at 15:33

2 Answers2

2

Based on the input you could try a classification model based on prior knowledge about the distributions. This linear discriminant analysis (LDA) picture, based on a multivariate example, shows the gist.

enter image description here

(The image is taken from https://stackoverflow.com/questions/17001375/plot-linear-discriminant-analysis-in-r)

In the univariate case this reduces to choosing the class that has the highest conditional probability.

Based on this the conditional probabilities it is also possible to create a particle filter. This is a solution for an online setting, which would resemble your running sensor data.

A particle filter is a software friendly implementation of a hidden markov model using resampling. A nice explanantion is given here: https://www.youtube.com/watch?v=aUkBa1zMKv4

HTH

spdrnl
  • 2,017
  • 8
  • 11
1

As I understand, you already know how to create features and classifiers, and the question is rather about the peculiarity of $C_2$ consisting of three subclasses from which you can sample freely.

From the two options you provided, I would prefer the first one, given that your chosen classifier is able to represent a decision surface more complicated than a straight line (hyperplane in your case). And in any case, the second case option with only 2 votes required for $C_1$ seems inferior, as it is easy to come up with examples where it would end up classifying all points in $C_2$ as being from $C_1$.

The possible improvements on top of my head you may want to consider are: sample more points from $C_2$ and compensating for that using class weights (if your classifier allows for that), and sampling different proportions (or using unequal weights), if you know a priory the expected ratios of the observations from $C_1, C_{2,1}, C_{2,2}, C_{2,3}$, or if you have different costs for different misclassification errors.

Also, an obvious remark, but you can always try several things and see which give best accuracy.

psarka
  • 1,110
  • 7
  • 16