3

There are classes A,B,C,D,E. Variable x has different means for each of these classes, but there is overlap in the range of x among classes. Item counts are different between classes, eg. there are many more A's than B's etc. Given an item with a known value of x but unknown class, how do I calculate the probability of it falling into class A vs. class B vs. class C etc.

eg. If x is close to the mean value for class C, there may be 60% probablity of this item falling into class C, 20% for class B, 10% for class D, 7% for class A and 3% for class E.

F.G.
  • 33
  • 3
  • Do you know anything more about the distribution of x in each class, other than just the mean? If you can represent x's distribution in each class by a normal distribution with known mean and variance, for example, the problem becomes much simpler. – Nuclear Hoagie Jul 31 '18 at 18:49
  • @Nuclear Wang. Yes, x's distribution within each class is known. However, it is not a normal distribution. This phenomena tends to be very right skewed. Becomes more normal when log transformed. – F.G. Jul 31 '18 at 18:53

1 Answers1

1

Your problem is one of probabilistic multiclass classification. A classical statistical approach is multinomial logistic regression. There are also many machine learning approaches, like CARTs or Random Forests.

(Multinomial) logistic regression automatically outputs conditional probabilities. For tree-based methods, you may need to specifically set a parameter. For instance, if you use the randomForest package in R, you need to apply predict.randomForest(...,type=prob). Or use a dedicated implementation.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357