improving prediction of the minority class for imbalanced data

Question

I am trying to classify a data set X of 2000 examples (rows) and 20 features (Columns) following the example code given here: https://stackoverflow.com/questions/4976539/support-vector-machines-in-matlab/4980055#4980055 The class labels are 0 or 1. I am using MATLAB 2018 version.

I have applied cross-validation using HoldOut method and using 60/40 of the data X. The data set is such that there are always more number of 0's (NO) than 1 (YES). I can change the labels -- make 1 as 0 but still there will always be an imbalance. My data set is naturally imbalanced and in real case scenario the dataset represents bank transactions. So, there will be more number of non-fraud transactions (labelled as 0) than fraud (1) . In my case misclassifying observations of the class labelled 1 has more severe consequences than misclassifying observations of the other class.

Question: The following is the code:

[g gn] = grp2idx(label);      %# nominal class to numeric

%# split training/testing sets

[train, test] = crossvalind('holdOut',label,0.5);
cp = classperf(label);

%% Use the svmtrain function to train an SVM classifier using a radial basis function and plot the grouped data.

svmStruct = fitcsvm(X(train,:),label(train),'KernelFunction','rbf');


%% Classify the test set using a support vector machine.
classes = predict(svmStruct,X(test,:));


%% Evaluate the performance of the classifier.
classperf(cp,classes,test);
cp.CorrectRate
cmat = confusionmat(label(test),classes);
acc = 100*sum(diag(cmat))./sum(cmat(:));
fprintf('SVM (1-against-1):\naccuracy = %.2f%%\n', acc);
fprintf('Confusion Matrix:\n'), disp(cmat)

I am getting all predicted classes as 1 and accuracy of 98.1%. Eventhough the accuracy is so high, the predicted class labels are mostly incorrect.

I know that missing a positive is worse than a false positive, so how do I fix this problem for SVM? Is there any other option in SVM which can tackle an imbalanced problem?

score 3 · Accepted Answer · answered Jun 06 '18 at 21:51

3

Seems like you want to specify misclassification costs: the cost of a false negative should be higher than the cost of a false positive. I am not an expert on MatLab or the fitcsvm function, but this seems to explain how to specify misclassification costs: https://www.mathworks.com/help/stats/fitcsvm.html#bt9w6j6-Cost

Alternatively, you could undersample the majority class or oversample the minority class.

answered Jun 06 '18 at 21:51

Marjolein Fokkema

1,363
6
22

Thank you for your answer. My dataset is supposed to be imbalanced quite like a fraud detection dataset where the number of fraud instances will be far less in comparison to non-fraud. So, for naturally imbalanced dataset would undersampling/oversampling apply? – Srishti M Jun 07 '18 at 15:28
Another question: should we be looking at increasing the TP?Say out of 900 test datapoints, 883 are known to be labelled as class 0 and 17 as class 1. Say the classification results from the diagonals in the confusion matrix is 97.9% TP for class 0 and 14.2% for class 1. How do I decide if this is good or bad? – Srishti M Jun 07 '18 at 16:31
I don't know if SVMs employ some kind of sampling strategy, if they do, undersampling the majority class could be effective. Under/oversampling would definitely apply for naturally imbalanced data. I think in general, undersampling the majority class is better than oversampling the minority class, (e.g., Drummond & Holte, 2003, C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11, pp. 1-8). Washington DC.) – Marjolein Fokkema Jun 08 '18 at 10:27
1

Thank you for your comments, but I found it a bit confusing. The aim should be increasing the sensitivity or specificity of a particular class probably the minority class which is the `fraud` case in this example. Can you please clarify? – Srishti M Jun 08 '18 at 17:49
Indeed confusing, I removed the comment. Let's call a case flagged as 'fraud' a positive, a case flagged as 'no fraud' a negative. You mention true negative and true positive rates, but they depend on the base rate, sens and spec do not. To catch all 'fraud' cases, sens (the proportion of 'fraud' cases flagged as 'fraud' by the SVM) should be as close to 1 as possible. Maximizing sens will reduce spec (proportion of 'no fraud' cases flagged as 'no fraud'). The cost of FPs relative to FNs (ie, misclassification costs) more or less determines how much spec may be reduced to increase sens. – Marjolein Fokkema Jun 10 '18 at 10:50
Thanks for the clarification. One last question: for binary classification if f measure = Nan and overall accuracy is 98% then the model is incorrect right? Under what circumstances this can happen? – Srishti M Jun 10 '18 at 19:34
The F-measure is the harmonic mean of sensitivity and proportion of TPs. F can be NaN in 3 situations: 1) Sens or TP are exactly 0. 2) Sens is NaN. This can happen if the data in which F is assessed do not contain any 'fraud' cases. 3) TP is NaN. This can happen if the data in which F is assessed do not contain any cases flagged as 'fraud'. If you are using a separate test sample to calculate the F-measure, situations 2) or 3) could have occurred, but this does not necessarily mean the model performs badly. Situation 1) does indicate that the model performs badly. – Marjolein Fokkema Jun 10 '18 at 22:43
Do not use accuracy to evaluate a classifier: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) The same arguments apply to the F1 score. In addition, unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Jul 26 '18 at 06:43

improving prediction of the minority class for imbalanced data

1 Answers1