1

I have a classification problem that deals with a big dataset with various categorical variables of multiple levels and the RF and XGBoost even deep learning cannot work better than 60% -70% in accuracy.

Since the classes of the response are quite unbalanced (one class only account for 7%, the other two are more than 40% respectively), even after I balance the classes with oversampling, the methods still do not work well with the class that has quite a few observations, thus I thought it may be due to the original features I used.

While I haven't found a good solution to do feature selection in terms that so many predictors are categorical and only a few are continuous.

Can somebody offer some advice in terms of feature selection specific to categorical predictors (if there is any such methods), or any advice in terms of improving the accuracy. Thanks!

EmLp
  • 55
  • 4
  • 3
    Why do you believe that an accuracy better than 60%-70% is possible for your problem? – Matthew Drury May 09 '18 at 16:20
  • Actually I'm not very sure about that, maybe because people talk about the excellent performance of XGBoost and Deeplearning, while I got even worse accuracy for the two methods compared with RF ( 81 to the most for RF and only 60% -70% for XGBoost and Deeplearning), which makes me think it may be due to the hyperparameter tuning of the two algorithms. I'm not an expert of the parameters of XGBoost and Deeplearning, while I'm trying my best for different combinations, still no obvious improvement in predictive performance... – EmLp May 09 '18 at 17:06
  • (especially for the class with the 7% obs, sensitivity is only around 30% while for the class with 40% obs, sensitivity can be 83% ). Some post (https://shiring.github.io/machine_learning/2017/03/07/grid_search) suggests, (quote) "that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the predictions you are trying to model.." Thus, I think maybe it is due to the original features I used (or I haven't found the appropriate combination of parameters for the ML algorithm). – EmLp May 09 '18 at 17:06
  • When you talk about performance, are you talking test set or train set? Take train set performance as an upper bound for test set performance. Also, what are the dimensions of your data? – Jim May 09 '18 at 17:58
  • Hi, Jim, I mean test set performance. The response has 3 classes which are unbalanced, and there are around 20 predictor variables (3 continuous and the others are categorical). – EmLp May 09 '18 at 18:02
  • 1) Please edit and include the train set performance (aka in-sample performance). 2) Dimensions: How many cases (aka observations)? How many variables (aka features)? 3) Do you think the predictors (aka regressors) should reasonably be able to predict the outcome? 4) What is the test set performance *per outcome class*? This info will help a lot in answering your question. P.S. please include a tag (like @EmLp). It pings me that there is a reply. – Jim May 10 '18 at 11:41
  • I would review https://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless and consider whether it is possible to derive more informative features, or collect more/higher quality data, or use a model that is more appropriate for your specific task. – Sycorax Jul 03 '18 at 16:13

1 Answers1

-1

A good general benchmark for categorical variables is LightGBM, since it groups categories in an intelligent manner.

Regarding the unbalanced classes you should probably not be looking at the accuracy for the small classes as their accuracy will almost always be zero (for problems that are not simple to learn) since the predicted probability for small classes will tend to be smaller than the larger classes.

If you want to look at the performance for the small classes specifically you can fix a threshold at say 1% (instead of 50%) for the small classes, and compute the accuracy compared to a uniformly random probability for each small class when using that threshold for classifying True.

LearnOPhile
  • 154
  • 6
  • Hi, thanks! I'll try the LightGBM. While when using prediction, we get the predicted probability of the obs belong to each classes (I have 3 classes in the response, and the 3 prob add up to 1 for each obs) and we'll assign the obs to the class with highest probability. I can understand "fix a threshold at say 1%" with binary classification, but how to understand it in terms of muti-category of the response? Say class 2 is the small class and only accounts for 7%, and if the predicted probs for one obs in testing dataset are 0.49(Class 1), 0.22(C2), 0.29(C3), how to classify with threshold.. – EmLp May 09 '18 at 17:30
  • That's a subtle issue. You need to divide the probability space (the simplex of p1+p2+p3=1) into three sets. I don't know of any good literature on strategies to do this, but would love to hear of any. – Matthew Drury May 09 '18 at 18:55
  • The way I would do it is to look at the accuracy for each class separately – effectively considering each case as a binary classification task. And select the threshold as e.g. the base rate of that class. – LearnOPhile May 10 '18 at 06:43