I am trying to classify a data set X
of 2000 examples (rows) and 20 features (Columns) following the example code given here: https://stackoverflow.com/questions/4976539/support-vector-machines-in-matlab/4980055#4980055
The class labels are 0 or 1. I am using MATLAB 2018 version.
I have applied cross-validation using HoldOut
method and using 60/40 of the data X
. The data set is such that there are always more number of 0
's (NO) than 1
(YES). I can change the labels -- make 1
as 0
but still there will always be an imbalance. My data set is naturally imbalanced and in real case scenario the dataset represents bank transactions. So, there will be more number of non-fraud transactions (labelled as 0
) than fraud (1
) . In my case misclassifying observations of the class labelled 1
has more severe consequences than misclassifying observations of the other class.
Question: The following is the code:
[g gn] = grp2idx(label); %# nominal class to numeric
%# split training/testing sets
[train, test] = crossvalind('holdOut',label,0.5);
cp = classperf(label);
%% Use the svmtrain function to train an SVM classifier using a radial basis function and plot the grouped data.
svmStruct = fitcsvm(X(train,:),label(train),'KernelFunction','rbf');
%% Classify the test set using a support vector machine.
classes = predict(svmStruct,X(test,:));
%% Evaluate the performance of the classifier.
classperf(cp,classes,test);
cp.CorrectRate
cmat = confusionmat(label(test),classes);
acc = 100*sum(diag(cmat))./sum(cmat(:));
fprintf('SVM (1-against-1):\naccuracy = %.2f%%\n', acc);
fprintf('Confusion Matrix:\n'), disp(cmat)
I am getting all predicted classes as 1
and accuracy of 98.1%. Eventhough the accuracy is so high, the predicted class labels are mostly incorrect.
I know that missing a positive is worse than a false positive, so how do I fix this problem for SVM? Is there any other option in SVM which can tackle an imbalanced problem?