Suggestions on binary classifiers for high dimensional categorical data set?

Question

I have a binary classification problem with 210 variables (2 levels 0/1) and I am wondering how should I approach this problem as algorithms which I used (logistic regression, random forests) did very poorly predicting all as 0. Data set has 12 861 observations with 2810 1s and 10 051 0s.

I tried also some feature selection using Boruta algorithm based on random forest, Pearson correlation and Cramer's V. Narrowing down list of variables to 20 most important ones didn't help as well (still only 0s).

Do you have any suggestions on how to tackle this? Would deep learning be suitable for such problem?

I have also done some visualization with t-SNE and PCA but nothing interesting there.

Many thanks

Any of the models you listed are suitable. Your observation that the model predicts all 0s is an artifact of how you decided to evaluate them; see https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models and many other related posts. — Sycorax, Jul 27 '20 at 15:22
Ok, so I have trained those models and they did much better on balanced data set. But once I want to make predicitons on the whole data set (unbalanced) using model trained on balanced data set, I get a lot of False Positives and the ratio of predicted 0s and 1s is roughly the same as in subset of balanced training data set (50/50), any advices? — glest, Jul 29 '20 at 09:28

Match Maker EE · Answer 1 · 2020-07-28T10:01:02.150

First a general remark. Some datasets contain discriminative features, others much less so. It may be that all your $210$ features have very little predictive power for the classification task your are investigating.

My advice for the next step is as follows. Draw at random $50\%$ cases from your $0$ category and $50\%$ at random from your $1$ category. This way, you end up with a balanced training set over the two categories. E.g. $2810/2 = 1405$ cases of category $1$ and $1405$ cases of category $0$. The remaining cases are your test set, for later evaluation. You now have a training set with a prior of $P(0)=\frac{1}{2}$ and $P(1)=\frac{1}{2}$. Note, cases should be picked using a random generator from each of the two classes.

Try random forests and maybe C4.5 (decision trees). You can also try logistic regression and linear discriminant (that latter only if your features are continuous numbers). The two regression classifiers are more forgiving in a large feature space with many redundant inputs than say a deep learning neural network. The decision tree algorithms perform inherent feature selection, which is why they should be tried out.

Now first look at your accuracy/error rate on this balanced training set. If you have good faith now, you can use the approach here to map the posterior probabilities of your classifiers to the skewed situation you have in your real domain. Apply the classifier to your test set, with the appropriate prior probability there.

Classifiers tend to train better from balanced training sets, because the variances of their parameters become smaller.

Would be nice if you will report any progress in this forum, e.g. in a comment to your question.

"that latter only if your features are continuous numbers" Linear regression does not need continuous features. — Dave, Jul 27 '20 at 15:17
Thanks a lot! Will try balanced data set and let you guys know how it went. — glest, Jul 28 '20 at 10:48

score 0 · Answer 2 · answered Jul 28 '20 at 10:32

From the description of your problem, it's clear that you have:

Low n/p ratio, since no. of observations are small and features are relatively high.
Class imbalance, since event rate is approx. 3%.

Both the cases are undesirable to any modelling procedure. You can separately address both the problems.

To increase n/p ratio: "Feature selection is an important scientific requirement for a classifier when p is large." — Page 658, The Elements of Statistical Learning.

To reduce redundancy in your dataset, you can look correlation among the numerical features and remove some if there high correlation or use wrapper methods that select features based on their contribution to a model when predicting the target variable(Recursive feature engineering). You can also use projection methods like SVD or PCA to get better representation of your data in low dimensional space. I think random forest should work well for feature selection. Try using it after resampling your data for class imbalance.

To address class imbalance:

You can try oversampling technique to improve class balancing.
Change your performance metric. This is very important, try to use F1 score which does better model evaluation if your data is imbalanced.
Try SMOTE to generate synthetic examples.
Decision trees often perform well on imbalanced and high categorical datasets. The splitting rules that look at the class variable in building trees can force both classes to be addressed.
Use class weights to give more weights on minority class observations.

There is no one method which will always work in this kind of problem. So, you may want to try different alternatives to address both the issues.

Thank you my friend, will use it and comment once I got some meaningful results, so far balanced data set performs much better then previous df. — glest, Jul 28 '20 at 13:20

Mark Ebden · Answer 3 · 2020-07-27T20:42:09.297

-1

If interactions among the 210 variables are a possibility, then use C4.5 or C5.0 on a balanced dataset, as MatchMakerEE's answer suggests. Random forests could be tried next if you're unhappy with the results.
If you expect no interactions among the 210 variables, or only known interactions that you can specify, instead first try factorial logistic regression with a balanced dataset.
If you're an advanced analyst who wants to consider during the modelling process a probabilistic map of relationships among the variables, consider the BN2O model and/or IDA (Intervention when the Directed acyclic graph is Absent). See for example Sections 10.2.3 and 26.6.3 of Kevin Murphy’s 2012 book, Machine Learning. Equally you could try a probabilistic expert system; see Section 10.4 (ibid).

edited Jul 27 '20 at 20:42

answered Jul 27 '20 at 13:30

Mark Ebden

419
1
4

You can improve this answer to be of more practical use to the person who formulated the question. – Match Maker EE Jul 27 '20 at 13:58
I note in your answer that you ask the OP to try random forests, logistic regression, etc -- which they say they have already tried? – Mark Ebden Jul 27 '20 at 14:03
I understand your point here, but a balanced training set can make a big difference in performance. You get a real measure of the extent to which the two classes can be distinguished from each other, without any influence of a skewed prior. – Match Maker EE Jul 27 '20 at 14:08
True, thanks. I'll mention that above. Sorry it overlaps with your answer a bit now... I am new to this site and not wanting to step on toes. – Mark Ebden Jul 27 '20 at 20:29
I will look up map of relationships, as I have not heard about them (I am not an advanced analyst though!), they sound interesting. Thanks Mark! – glest Jul 28 '20 at 10:50

Suggestions on binary classifiers for high dimensional categorical data set?

3 Answers3